Beginning the analysis, I’ve loaded the project data and relevant R packages. Additionally, I developed a custom function, custom_kable, that is useful for constructing aesthetically pleasing charts. Further, I also set my ggplot2 theme to bw, which is my preference. With that done, I have all the building blocks to begin the project.
library(dplyr)
library(ggplot2)
library(ggfortify)
library(GGally)
library(reshape2)
library(corrplot)
library(zoo)
library(rgl)
library(knitr)
library(kableExtra)
treasury_data <- read.csv("regression_final_project.csv")
#setting ggplot preference
theme_set(
theme_bw()
)
#creating custom table function
custom_kable <- function(x){
kable(x, format = "html") %>%
kable_styling(bootstrap_options = "striped")
}
To start, I did a quick review of the data. As in the chart below, there are 11 variables, most of which numeric. These are interest rates for various US Treasury yields to maturity, which range from 3 months to 30 years. The other variables are for dates (i.e. the treasury bond interest rate on that day) in addition to categorical markers indicating if the time period was one where the Fed was easing or tightening monetary policies. Finally, there is an anonymized output variable, which serves as the project’s outcome focus. Right now, it’s unclear what this variable entails but, the project will work towards.
data.frame(
variable = names(treasury_data),
data.type = sapply(treasury_data, typeof),
values = sapply(treasury_data, function(x) paste0(head(x), collapse = ", ")),
row.names = NULL) %>%
custom_kable()
| variable | data.type | values |
|---|---|---|
| Date | integer | 1/5/1981, 1/6/1981, 1/7/1981, 1/8/1981, 1/9/1981, 1/12/1981 |
| USGG3M | double | 13.52, 13.58, 14.5, 14.76, 15.2, 15.22 |
| USGG6M | double | 13.09, 13.16, 13.9, 14, 14.3, 14.23 |
| USGG2YR | double | 12.289, 12.429, 12.929, 13.099, 13.539, 13.179 |
| USGG3YR | double | 12.28, 12.31, 12.78, 12.95, 13.28, 12.94 |
| USGG5YR | double | 12.294, 12.214, 12.614, 12.684, 12.884, 12.714 |
| USGG10YR | double | 12.152, 12.112, 12.382, 12.352, 12.572, 12.452 |
| USGG30YR | double | 11.672, 11.672, 11.892, 11.912, 12.132, 12.082 |
| Output1 | double | 18.01552579, 18.09139801, 19.44731416, 19.74851024, 20.57204231, 20.1421849 |
| Easing | integer | NA, NA, NA, NA, NA, NA |
| Tightening | integer | NA, NA, NA, NA, NA, NA |
As another quick data glance, the first 6 rows of each variable can be seen below. This further reinforces that both Easing and Tightening have numerous NA’s.
head(treasury_data) %>%
custom_kable
| Date | USGG3M | USGG6M | USGG2YR | USGG3YR | USGG5YR | USGG10YR | USGG30YR | Output1 | Easing | Tightening |
|---|---|---|---|---|---|---|---|---|---|---|
| 1/5/1981 | 13.52 | 13.09 | 12.289 | 12.28 | 12.294 | 12.152 | 11.672 | 18.01553 | NA | NA |
| 1/6/1981 | 13.58 | 13.16 | 12.429 | 12.31 | 12.214 | 12.112 | 11.672 | 18.09140 | NA | NA |
| 1/7/1981 | 14.50 | 13.90 | 12.929 | 12.78 | 12.614 | 12.382 | 11.892 | 19.44731 | NA | NA |
| 1/8/1981 | 14.76 | 14.00 | 13.099 | 12.95 | 12.684 | 12.352 | 11.912 | 19.74851 | NA | NA |
| 1/9/1981 | 15.20 | 14.30 | 13.539 | 13.28 | 12.884 | 12.572 | 12.132 | 20.57204 | NA | NA |
| 1/12/1981 | 15.22 | 14.23 | 13.179 | 12.94 | 12.714 | 12.452 | 12.082 | 20.14218 | NA | NA |
Following the first data review, I am going to make one transformation. Here, I’m simply changing Date from a factor to a formal date.
treasury_data <- treasury_data %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y"))
Moving into some analysis, I’ve plotted all the treasury yields from their first to last date. The chart highlights that these yields have steadily declined across all categories since 1981. It’s interesting to review the plot and match the lines to historical financial issues, such as the 2008 market crash stemming from the sub-prime mortgage issues. The general cyclical nature of the economy can be seen from the graphic. As a note, there is a gap in the data in 2007 that accounts for the
bond_trend <- treasury_data %>%
select(-Easing, -Tightening) %>%
melt(value.name = "interest.rate",
variable.name = "treasury.yield",
id.vars = "Date")
bond_trend %>%
filter(treasury.yield != "Output1") %>%
ggplot(aes(Date, interest.rate, colour = treasury.yield)) +
geom_line(size = 1.3, alpha = .3) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
geom_vline(xintercept = as.Date(c("2007-11-29", "2008-11-11")),
colour = "royalblue2", size = 1.3, alpha = .5) +
annotate("rect", xmin = as.Date("2007-11-29"), xmax = as.Date("2008-11-11"),
ymin = 0, ymax = max(treasury_data$USGG3M),
alpha = .1, fill = "dodgerblue2") +
scale_y_continuous(breaks = seq(0, 20, 2)) +
labs(title = "Interest rates have declined steadily from 1981 to 2014 in all bond categories",
subtitle = "Vertical blue lines from 2007 to 2008 denote a gap in yield rates as a result of missing data",
caption = "Source: Course Project Treasury Data")
As an added feature, I’ve also include a summary table for each decade. I developed the decade feature using Date and then grouped it using the melted data frame from the plot above. Using this, the aggregate trend for each decade can be seen. I’ve included a range of statistics here but, the table is sorted by yield mean. As gleaned from the chart, these yield rates have steadily declined since the 1980s. Interestingly, the 1980s also had the largest range from about 17 to 5; this spread can be further seen given the decade has the highest standard deviation. The aforementioned 2008 financial crisis is evident again where the yields dropped slightly under zero on some dates.
bond_trend %>%
mutate(decade.marker = substr(as.character(Date), 1, 3),
decade = case_when(
decade.marker == "198" ~ "1980s",
decade.marker == "199" ~ "1990s",
decade.marker == "200" ~ "2000s",
decade.marker == "201" ~ "2010s")) %>%
filter(treasury.yield != "Output1") %>%
group_by(decade) %>%
summarise(yield.mean = mean(interest.rate),
yield.median = median(interest.rate),
yield.sd = sd(interest.rate),
yield.min = min(interest.rate),
yield.max = max(interest.rate)) %>%
arrange(desc(yield.mean)) %>%
custom_kable()
| decade | yield.mean | yield.median | yield.sd | yield.min | yield.max |
|---|---|---|---|---|---|
| 1980s | 9.837335 | 9.19800 | 2.578732 | 5.1210 | 17.0100 |
| 1990s | 5.972553 | 5.87700 | 1.326211 | 2.6480 | 9.1720 |
| 2000s | 3.747912 | 4.17980 | 1.639327 | -0.0410 | 6.9080 |
| 2010s | 1.259734 | 0.68025 | 1.324698 | -0.0152 | 4.8395 |
The plots can also be viewed using facets, which makes each yield period clearer.
bond_trend %>%
filter(treasury.yield != "Output1") %>%
ggplot(aes(Date, interest.rate, colour = treasury.yield)) +
geom_line(size = 1.3) +
facet_wrap(facets = "treasury.yield", nrow = 2) +
theme(legend.position = "none") +
labs(title = "Interest rates have declined steadily from 1981 to 2014 in all bond categories",
caption = "Source: Course Project Treasury Data")
So far, the plots have only included the treasury interest rates for various bond types. However, the yet unknown outcome variable hasn’t been included alongside. To see how the predictor variables interact with the output, the plot below includes all trajectories. As seen, the outcome variable has been more erratic than the predictors. Despite this, at first glance it generally seems highly correlated with the input variables despite some scale differences. Of note, the variable drops well below zero on many dates, including most of the 2000s and all of the 2010s.
bond_trend %>%
ggplot(aes(Date, interest.rate, colour = treasury.yield)) +
geom_line(size = 1.3, alpha = .4) +
scale_y_continuous(breaks = seq(-20, 30, 5)) +
geom_hline(yintercept = 0, colour = "royalblue2", size = 1.3, alpha = .3) +
guides(colour = guide_legend(override.aes = list(size = 2))) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
labs(title = "Interest rates have declined steadily from 1981 to 2014 in all bond categories",
subtitle = "Output1 has been more volatile than predictors in given timeframe -- variable identity still unknown",
caption = "Source: Course Project Treasury Data",
y = "interest rate (and unknown output values)")
The table below is similar to the yields for predictors but, the output extremes are evident. The values are much higher than the yields in the 1980s but, drop well below starting in the early 1990s. While the variable remains shrouded in mystery, the plot and table highlight some attributes of how it behaves. Again, despite some scaling variation, the output and inputs seem strongly correlated.
bond_trend %>%
mutate(decade.marker = substr(as.character(Date), 1, 3),
decade = case_when(
decade.marker == "198" ~ "1980s",
decade.marker == "199" ~ "1990s",
decade.marker == "200" ~ "2000s",
decade.marker == "201" ~ "2010s")) %>%
filter(treasury.yield == "Output1") %>%
group_by(decade) %>%
summarise(output.mean = mean(interest.rate),
output.median = median(interest.rate),
output.sd = sd(interest.rate),
output.min = min(interest.rate),
output.max = max(interest.rate)) %>%
arrange(desc(output.mean)) %>%
custom_kable()
| decade | output.mean | output.median | output.sd | output.min | output.max |
|---|---|---|---|---|---|
| 1980s | 10.9393048 | 9.3803598 | 6.4622746 | 1.227457 | 27.297880 |
| 1990s | 0.7140675 | 0.4043963 | 2.7735960 | -4.166962 | 8.384047 |
| 2000s | -5.1761605 | -5.3571968 | 3.5944082 | -12.208100 | 2.422812 |
| 2010s | -11.8650363 | -11.8838808 | 0.8610511 | -13.173111 | -9.634017 |
To check the formal relationship between the inputs and output before getting into the linear modelling, I constructed a correlation visualization. As seen, the correlations are extremely high. In fact, they’re perfect in some cases, as with USGG2YR to USGG5YR, and almost perfect in the remaining few. This is a strong indication that the output variable might reasonably be some derivation of the inputs. That said, what that might be is still very unclear and there’s more work to uncover what the output actually is.
corplot_df <- treasury_data %>%
select(USGG3M, USGG6M, USGG2YR, USGG3YR, USGG5YR, USGG10YR, USGG30YR, Output1) %>%
rename(three.m = USGG3M,
six.m = USGG6M,
two.yr = USGG2YR,
three.yr = USGG3YR,
five.yr = USGG5YR,
ten.yr = USGG10YR,
thirty.yr = USGG30YR,
output = Output1)
corplot_df <- cor(corplot_df)
corrplot.mixed(corplot_df,
title = "Correlation plot for Treasury Data- Output shows almost perfect correlation to inputs", mar = c(0, 0, 2, 0))
Following the initial exploratory work, I’ll being the modelling phase. To begin, I’ll be developing seven individual linear models, one for each treasury yield category versus the output.
linear3M <- lm(Output1 ~ USGG3M, data = treasury_data)
linear6M <- lm(Output1 ~ USGG6M, data = treasury_data)
linear2YR <- lm(Output1 ~ USGG2YR, data = treasury_data)
linear3YR <- lm(Output1 ~ USGG3YR, data = treasury_data)
linear5YR <- lm(Output1 ~ USGG5YR, data = treasury_data)
linear10YR <- lm(Output1 ~ USGG10YR, data = treasury_data)
linear30YR <- lm(Output1 ~ USGG30YR, data = treasury_data)
Starting with the first model, which focuses on the 3-month yield, it’s clear that the simple linear regressions should be very significant. There were clear signs this might be the case during the correlation work but, the model summary helps solidify this intuition. Here, the model as a whole has a very significant slope indicating the alternative hypothesis, that there is a relationship between input vs output variable, can be accepted.
The slope estimate is 2.50756, which suggests that for every one unit increase in the output, the input goes up by that amount. In real terms here, that means for one unit change in output, the bond rate would be expected to go up by about 2.5 points. The -11.72318 intercept is less interesting in any pragmatic sense given interest rates cannot go below zero in the US and the output is unknown. Nonetheless, this indicates that without any interaction, the output level would be at that value. Both the slope and intercept are significant as well. Broadly speaking, these mean that there is a verifiable relationship between the 3-month bond interest rate and the output. Assuming the output is of some significance to a business or government audience, this relationship would be very useful for understanding market trends or possibly forecasting. That said, with the output still unknown, it’s hard to draw many conclusions about the wider applicability of the model. Still, there are signs that this might possibly have fruitful applications.
summary(linear3M)
##
## Call:
## lm(formula = Output1 ~ USGG3M, data = treasury_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9374 -1.2115 -0.0528 1.2640 7.7048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.72318 0.03137 -373.7 <2e-16 ***
## USGG3M 2.50756 0.00541 463.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.69 on 8298 degrees of freedom
## Multiple R-squared: 0.9628, Adjusted R-squared: 0.9628
## F-statistic: 2.148e+05 on 1 and 8298 DF, p-value: < 2.2e-16
To better review this relationship, I’ve included a scatterplot for the variables. This significant association is on display here with a very strong positive relationship. Further accentuating the association is a regression line. Of note, I’ve included it with standard error, although there is so little that it isn’t visible.
Perhaps more interestingly, the points show an uneven, clustered pattern. This is a clue that the relationship is different depending on the value of each variable. Additionally, this almost guarantees that the residuals stemming from the regression are unequal since points in the top have a different pattern than in the bottom portions. In fact, many sections seem to be very uneven. Again, I feel like this points to a non-traditional “output” and towards one that is in someway linked to the interest rate predictors (if not wholly derived from them).
treasury_data %>%
ggplot(aes(USGG3M, Output1)) +
geom_jitter(colour = "darkgray", alpha = .5) +
geom_smooth(method = "lm", size = 1.3) +
labs(title = "Output1 vs USGG3M interest rate- variables have a very significant relationship",
subtitle = "Scatteprlot pattern displays clustered points which are unevenly distributed ",
caption = "Source: Course Project Treasury Data")
The plot raises another point. Given that the models are likely quite homogeneous, owing to similarly high correlations coefficients and plot trends, I decided to review them in unison. This helps highlight model comparisons while centralizing many model coefficients. Additionally, I wanted to try making a few custom functions to capture the various model outputs as well. To this end, I started with collecting all the model variance.
This is a crucial step in the analysis. Collecting and reviewing how much variance a model explains is central to inferential statistics. To do this, there are a few variance related calculations that need to be made. First, I started with total variance. This is captured by taking the variance of Output1. In general terms, this measures how spread out each value from the output variable is from its average value. Using regression, the idea is to explain how much variance can be accounted for by the predictor.
This leads to the second calculation, which is unexplained variance. With a simple regression model, this means the amount of variance that the predictor variable does not explain. To derive this value, I squared the model’s sigma, or variance. This removes any negative values and generally normalizes the values. The resulting value is the unexplained variance, which can then be compared against the total variance to assess model fit. If the model explains the output variance well, the unexplained variance should be low. Overall, this is important because it situates how well the model explains the output variable, which is ultimately one of the goals of regression.
To automate this process across all the seven models, I utilized sapply to functionally run the models and collect the unexplained variance at once. Further, I captured them in a data frame so they could be presented nicely while also adding in a new variable which is the total percent explained by the model. This is a manual R2, which can be checked for accuracy against the normal model summary to make sure the calculations are correct.
The output of the custom function can be seen below. It shows that all of the models explain a very high amount of the output. The model I started with, the 3-month bond, is on the lower side of variance explained, although that’s extremely relative here given an R2 of about .96. The lowest model here is the 30-year bond with an R2 of about .935 while the 3-year bond is highest with an absurd .998 R2. This table really highlights how well each predictor explains the output variable. They lend further credence to the output being some amalgamation of the inputs as well though given some of the models explain all variance in Output1.
variance_explained <- function(columns){
var_df <- data.frame(
Model = colnames(treasury_data[,columns]),
Total.Variance = var(treasury_data$Output1),
Unexplained.Variance = sapply(columns, function(x) summary(lm(Output1 ~ treasury_data[,x], data = treasury_data))$sigma ^ 2))
var_df %>%
mutate(Percent.Explained.R2 = (Total.Variance - Unexplained.Variance) / Total.Variance * 100) %>% custom_kable()
}
variance_explained(2:8)
| Model | Total.Variance | Unexplained.Variance | Percent.Explained.R2 |
|---|---|---|---|
| USGG3M | 76.80444 | 2.8570577 | 96.28009 |
| USGG6M | 76.80444 | 1.9673208 | 97.43853 |
| USGG2YR | 76.80444 | 0.2588092 | 99.66303 |
| USGG3YR | 76.80444 | 0.1596570 | 99.79213 |
| USGG5YR | 76.80444 | 0.6382572 | 99.16898 |
| USGG10YR | 76.80444 | 2.3667523 | 96.91847 |
| USGG30YR | 76.80444 | 4.9672862 | 93.53255 |
Much like the previous custom function deriving variance explanations for each model, I put together one to review model significance (albeit with apply). Through this, I’m checking to see if the overall model slope is significantly different than zero (as would be the null hypothesis). Not surprisingly considering the previous work, all the models are highly significant. In fact, their respective p-values are so small, they register as zero. Alongside the model p-values, I’ve included the F statistic and degrees of freedom for model and residuals. These are used to derive the p-value in the F test, which I’ve included in the function using pf and the model appropriate components.
model_significance <- function(columns){
sig_df <- data.frame(
model = colnames(treasury_data[,columns]),
t(apply(treasury_data[,columns], 2, function(x) summary(lm(Output1 ~ x, treasury_data))$fstatistic)))
sig_df %>%
rename(f.statistic = value) %>%
mutate(model.p.value = pf(f.statistic, numdf, dendf, lower.tail = F)) %>%
custom_kable()
}
model_significance(2:8)
| model | f.statistic | numdf | dendf | model.p.value |
|---|---|---|---|---|
| USGG3M | 214798.7 | 1 | 8298 | 0 |
| USGG6M | 315695.9 | 1 | 8298 | 0 |
| USGG2YR | 2454520.2 | 1 | 8298 | 0 |
| USGG3YR | 3984011.3 | 1 | 8298 | 0 |
| USGG5YR | 990359.0 | 1 | 8298 | 0 |
| USGG10YR | 261016.2 | 1 | 8298 | 0 |
| USGG30YR | 120021.6 | 1 | 8298 | 0 |
apply function.For my previous functions I’ve used both sapply and apply, which will continue here for collecting all the model slopes and intercepts. Per the question though, I’ve utilized apply as part of a larger function to collect and display model coefficients.
Having already covered the slope and intercept for the 3-month bond, all the interpretations here are similar, save for the values themselves. All the models have similar slopes ranging from 2.4 (USGG2YR) to about 3.1 (USGG30YR). This means that for every one unit increase in the output, the expected interest rate would increase by this value. The intercepts vary more widely from -11.72318 (USGG3M) to -21.08590 (USGG30YR). In combination with the other slopes, this means that each line has slightly different angles, despite being so similar.
Again, all the slopes and intercepts are very significant, as seen by zeroes in both columns. These p-values are derived using the T distribution test, hence the inclusion of their respective t-values (all of which are, naturally, very high).
input_coefficients <- function(columns){
input_df <- data.frame(
Model = colnames(treasury_data[,columns]),
t(apply(treasury_data[,columns], 2, function(x) summary(lm(Output1 ~ x, treasury_data))$coefficients[c(1, 2, 5, 6, 7, 8)])))
rownames(input_df) <- NULL
input_df %>%
rename(Intercept = X1,
Slope = X2,
Intercept.t.value = X3,
Slope.t.value = X4,
Intercept.p.value = X5,
Slope.p.value = X6) %>% custom_kable()
}
(input_coef <- input_coefficients(2:8))
| Model | Intercept | Slope | Intercept.t.value | Slope.t.value | Intercept.p.value | Slope.p.value |
|---|---|---|---|---|---|---|
| USGG3M | -11.72318 | 2.507561 | -373.7126 | 463.4638 | 0 | 0 |
| USGG6M | -12.09753 | 2.497235 | -457.0457 | 561.8683 | 0 | 0 |
| USGG2YR | -13.05577 | 2.400449 | -1301.5071 | 1566.6908 | 0 | 0 |
| USGG3YR | -13.86162 | 2.455793 | -1687.6241 | 1995.9988 | 0 | 0 |
| USGG5YR | -15.43665 | 2.568742 | -866.3143 | 995.1678 | 0 | 0 |
| USGG10YR | -18.06337 | 2.786991 | -461.0150 | 510.8975 | 0 | 0 |
| USGG30YR | -21.08590 | 3.069561 | -321.4475 | 346.4413 | 0 | 0 |
The last measure to review from each model is adjusted R2. While R2 was discussed, it’s worth spending time on the adjusted version as well. Adjusted R2 still accounts for how much variance a model explains but, with a slight formula difference. Inherently, adding new predictors explains more variance in a model. However, it might not add any new worthwhile information. To offset adding new but ultimately less useful predictors and inflating R2, adjusted R2 provides a “penalty” and can be lowered as a result. Ideally, the model will be equal on both metrics- if adjusted is lower, it’s a sign that there is redundant or less than ideal predictors included. In all the model’s here, the adjusted R2 matches the original variance explained numbers. Even though these are one predictor models, seeing them perfectly aligned signals there isn’t any excess or redundant information.
adj_rsquared <- data.frame(
lapply(treasury_data[,2:8],
function(x) summary(lm(Output1 ~ x, data = treasury_data))[9])
)
names(adj_rsquared) <- names(treasury_data[,2:8])
rownames(adj_rsquared) <- "adj.r.squared"
custom_kable(t(adj_rsquared))
| adj.r.squared | |
|---|---|
| USGG3M | 0.9628009 |
| USGG6M | 0.9743853 |
| USGG2YR | 0.9966303 |
| USGG3YR | 0.9979213 |
| USGG5YR | 0.9916898 |
| USGG10YR | 0.9691847 |
| USGG30YR | 0.9353255 |
Before plotting the fitted values, it’s worth doing a final review of each linear regression model but, from a visual standpoint. Previously, I plotted USGG3M so I thought it was worthwhile to revisit all the bivariate relationships. As seen, they all have the same uneven distributions. This lends visual confirmation to the .998 R2 for USGG3YR as well- the points are almost wholly covered by the regression line.
bond_trend %>%
filter(treasury.yield != "Output1") %>%
mutate(Output1 = rep(treasury_data$Output1, 7)) %>%
ggplot(aes(interest.rate, Output1)) +
geom_jitter(aes(colour = treasury.yield), alpha = .2) +
geom_smooth(method = "lm", size = 1.3) +
facet_wrap(facets = "treasury.yield", nrow = 2) +
theme(legend.position = "none") +
labs(title = "Output1 vs all bond category interest rates- all variables have a very significant relationship",
subtitle = "Scatteprlot patterns all display clustered points which are unevenly distributed ",
caption = "Source: Course Project Treasury Data")
These scatterplots also portend the fitted values versus the actuals quite well too. As before, I’ve collected all the fitted values, albeit using lapply. These further confirm the very high model fits. Again, USGG3YR displays a nearly perfect fit, though the plots for USGG2YR and USGG5YR show very close fits as well.
These are in line with my expectations following the linear model summary reviews. The highest R2 models seem to have the best fitted values. One slight surprise is the USGG30YR seems to have slightly more erratic fitted values, specifically in the 2000s, when compared to the best performing models. Similarly, the USGG3M, USGG6M, and USGG10YR have periods where they differ from the actual values. However, even in these cases the models still include fitted values that are in close alignment with the original inputs.
fitted_values <- data.frame(
lapply(treasury_data[,2:8],
function(x) lm(x ~ Output1, data = treasury_data)$fitted.values)
)
fitted_values <- melt(fitted_values,
value.name = "fitted.values",
variable.name = "treasury.yield")
bond_trend <- bond_trend %>%
filter(treasury.yield != "Output1")
bond_trend <- cbind(bond_trend, fitted_values$fitted.values)
bond_trend %>%
rename(fitted.values = `fitted_values$fitted.values`) %>%
ggplot(aes(x = Date)) +
geom_line(aes(y = interest.rate), colour = "royalblue3", alpha = .75) +
geom_line(aes(y = fitted.values), size = 1.3,
colour = "orange2", alpha = .25) +
facet_wrap(facets = "treasury.yield", nrow = 2) +
labs(title = "Input values plotted against fitted regression values for all bond categories",
subtitle = "All models have very close alignment between actual (blue line) and fitted (orange line)- 3YR is closest (Adj. R2 = 0.998)",
caption = "Source: Course Project Treasury Data")
Another crucial part of regression is checking on residuals against fitted values. While the previous plot shows the very close fit, it’s also instructive to check to see how the residual and fitted are distributed using a scatterplot. For convenience, I’ve made a data frame to check all the bond categories simultaneously.
In an ideal situation, the scatterplot should be randomly distributed but, with values as close to zero as possible. However, these plots do not show that. In fact, they all exhibit systematic error, which is not ideal. This again points to some underlying systematic issue with the variables; I think this adds further strength to the idea the output is an amalgam of the inputs in some way.
bond_categories <- colnames(treasury_data[,2:8])
residual_check <- data.frame(
model = factor(rep(bond_categories, each = 8300), levels = bond_categories),
resid = c(linear3M$residuals, linear6M$residuals, linear2YR$residuals,
linear3YR$residuals, linear5YR$residuals, linear10YR$residuals,
linear30YR$residuals),
fitted = c(linear3M$fitted.values, linear6M$fitted.values, linear2YR$fitted.values,
linear3YR$fitted.values, linear5YR$fitted.values, linear10YR$fitted.values,
linear30YR$fitted.values))
residual_check %>%
ggplot(aes(fitted, resid, colour = model)) +
geom_jitter(alpha = .2, show.legend = F) +
geom_hline(yintercept = 0, colour = "dodgerblue2", size = 1.3) +
facet_wrap(facets = "model", nrow = 2, scales = "free_y") +
labs(title = "Residuals vs fitted values for simple linear models from all bond categories",
subtitle = "Points should be randomly distributed around zero; however, all plots seem to have systematic patterns signalling model issues",
caption = "Source: Course Project Treasury Data",
y = "model residuals",
x = "fitted values")
Moving into the next section, I’m developing linear models but, with the output as the predictor this time.
linear3M_output <- lm(treasury_data[,2] ~ Output1, data = treasury_data)
linear6M_output <- lm(treasury_data[,3] ~ Output1, data = treasury_data)
linear2YR_output <- lm(treasury_data[,4] ~ Output1, data = treasury_data)
linear3YR_output <- lm(treasury_data[,5] ~ Output1, data = treasury_data)
linear5YR_output <- lm(treasury_data[,6] ~ Output1, data = treasury_data)
linear10YR_output <- lm(treasury_data[,7] ~ Output1, data = treasury_data)
linear30YR_output <- lm(treasury_data[,8] ~ Output1, data = treasury_data)
summary(linear3YR_output)
##
## Call:
## lm(formula = treasury_data[, 5] ~ Output1, data = treasury_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.23099 -0.10862 -0.01352 0.10095 0.83112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6444580 0.0017841 3164 <2e-16 ***
## Output1 0.4063541 0.0002036 1996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1625 on 8298 degrees of freedom
## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9979
## F-statistic: 3.984e+06 on 1 and 8298 DF, p-value: < 2.2e-16
As per the instructions, I’ve collected all the slopes and intercepts with the same basic apply function used for the initial linear models. While the function is slightly redundant here, I wanted to get some development practice, so it’s included.
The intercepts follow the same basic pattern as the initial linear models and rise slowly from USGG3M to USGG30YR, though the values make up a narrower range. The slopes follow a similar construction as well. Perhaps more interesting is that both slope t-values are the same in both models. The formula for slope t-value is the slope coefficient divided by standard error and despite different values from both models, the t-value ends up being identical. How does this work?
It’s actually an instructive point to review. Despite differences in slopes and standard errors between models, since the same two variables are being used, the test statistic ends up being the same. Given a t-value measures the size of the difference relative to the variation in a sample, the switched input and output don’t change this. In the same vein, the model R2 do not change given the amount of variance being explained by each model does not change. While a slight digression, it’s important to delve into the underlying statistical work and review why these coefficients end up being the same number.
output_coefficients <- function(columns){
input_df <- data.frame(
Model = colnames(treasury_data[,columns]),
t(apply(treasury_data[,columns], 2, function(x) summary(lm(x ~ Output1, treasury_data))$coefficients[c(1, 2, 5, 6, 7, 8)])))
rownames(input_df) <- NULL
input_df %>%
rename(Intercept = X1,
Slope = X2,
Intercept.t.value = X3,
Slope.t.value = X4,
Intercept.p.value = X5,
Slope.p.value = X6) %>% custom_kable()
}
(output_coef <- output_coefficients(2:8))
| Model | Intercept | Slope | Intercept.t.value | Slope.t.value | Intercept.p.value | Slope.p.value |
|---|---|---|---|---|---|---|
| USGG3M | 4.675134 | 0.3839609 | 643.9555 | 463.4638 | 0 | 0 |
| USGG6M | 4.844369 | 0.3901870 | 796.0347 | 561.8683 | 0 | 0 |
| USGG2YR | 5.438888 | 0.4151851 | 2341.9882 | 1566.6908 | 0 | 0 |
| USGG3YR | 5.644458 | 0.4063541 | 3163.8133 | 1995.9988 | 0 | 0 |
| USGG5YR | 6.009421 | 0.3860610 | 1767.6898 | 995.1678 | 0 | 0 |
| USGG10YR | 6.481316 | 0.3477544 | 1086.5690 | 510.8975 | 0 | 0 |
| USGG30YR | 6.869355 | 0.3047124 | 891.2273 | 346.4413 | 0 | 0 |
Onto the logistic component focusing on federal reserve easing and tightening. This section starts with some data preparation. The initial set only has a one where the period of easing or tightening takes place. However, without any other markers, there isn’t much to work with. To this end, this first chunk starts the process of transforming and combining these columns by making any period of easing equal to zero. Given this is a binary marker, since the fed cannot be simultaneously easing and tightening, these are now prepped for combination to make a period of tightening equal to one and easing zero.
federal_cycles <- treasury_data %>%
mutate(Easing = ifelse(Easing == 1, 0, NA),
Tightening = ifelse(Tightening == 1, 1, NA))
Staying with the flow of the assignment, I checked against the original copy to make sure my transformed values match. In this case, they did.
custom_kable(federal_cycles[c(550:560, 900:910, 970:980),c(1, 10, 11)])
| Date | Easing | Tightening | |
|---|---|---|---|
| 550 | 1983-03-29 | NA | NA |
| 551 | 1983-03-30 | NA | NA |
| 552 | 1983-03-31 | NA | NA |
| 553 | 1983-04-04 | NA | 1 |
| 554 | 1983-04-05 | NA | 1 |
| 555 | 1983-04-06 | NA | 1 |
| 556 | 1983-04-07 | NA | 1 |
| 557 | 1983-04-08 | NA | 1 |
| 558 | 1983-04-11 | NA | 1 |
| 559 | 1983-04-12 | NA | 1 |
| 560 | 1983-04-13 | NA | 1 |
| 900 | 1984-08-27 | NA | 1 |
| 901 | 1984-08-28 | NA | 1 |
| 902 | 1984-08-29 | NA | 1 |
| 903 | 1984-08-30 | NA | 1 |
| 904 | 1984-08-31 | NA | 1 |
| 905 | 1984-09-04 | NA | NA |
| 906 | 1984-09-05 | NA | NA |
| 907 | 1984-09-06 | NA | NA |
| 908 | 1984-09-07 | NA | NA |
| 909 | 1984-09-10 | NA | NA |
| 910 | 1984-09-11 | NA | NA |
| 970 | 1984-12-10 | 0 | NA |
| 971 | 1984-12-11 | 0 | NA |
| 972 | 1984-12-12 | 0 | NA |
| 973 | 1984-12-13 | 0 | NA |
| 974 | 1984-12-14 | 0 | NA |
| 975 | 1984-12-17 | 0 | NA |
| 976 | 1984-12-18 | 0 | NA |
| 977 | 1984-12-19 | 0 | NA |
| 978 | 1984-12-20 | 0 | NA |
| 979 | 1984-12-21 | 0 | NA |
| 980 | 1984-12-24 | 0 | NA |
As alluded to, the end goal here was to combine the federal monetary policy markers into one column, now called tightening. I’ve also created a stand-alone data frame for this component to avoid transforming the original data too much. In the same vein, there are numerous NA’s (about 70%) so making a reduced set makes sense.
federal_cycles <- federal_cycles %>%
filter(!is.na(Easing) | !is.na(Tightening)) %>%
mutate(Tightening = ifelse(is.na(Tightening), 0, Tightening)) %>%
select(-Easing)
I thought doing a quick profile of the new set was reasonable since so many values were dropped. The table shows that federal reserve easing or tightening took place between fall 1981 and fall 1992, a period spanning slightly over a decade. To put this in perspective, the entire data set spans from January 1981 to June 2014, so a considerable portion has been removed for this component of the analysis.
federal_cycles %>%
summarise(min.date.range = min(Date),
max.date.range = max(Date),
date.range = difftime(max.date.range, min.date.range, units = "days"),
years.total = round(as.numeric(date.range) / 365)) %>%
custom_kable()
| min.date.range | max.date.range | date.range | years.total |
|---|---|---|---|
| 1981-11-02 | 1992-09-04 | 3959 days | 11 |
Additionally, the variable is fairly unbalanced. There are far more easing than tightening samples (1585 vs 773).
federal_cycles %>%
mutate(Tightening = ifelse(Tightening == 0, "Easing", "Tightening")) %>%
group_by(Tightening) %>%
count() %>%
summarise(period_count = n,
variable_percent = n / nrow(federal_cycles) * 100) %>%
custom_kable()
| Tightening | period_count | variable_percent |
|---|---|---|
| Easing | 1585 | 67.21798 |
| Tightening | 773 | 32.78202 |
Before jumping into the plots, some context is necessary. So far, I’ve eschewed reviewing the nuance of what easing and tightening entails from a monetary policy perspective. With this concept being essential to this section though, I’ve included a brief overview here. Central banks around the world use monetary policy to regulate specific factors within the economy. Central banks most often use the federal funds rate as a leading tool for regulating market factors. This can entail using rate setting as part of the fiscal policy tool kit. Tightening policy occurs when central banks raise the federal funds rate, and easing occurs when central banks lower the federal funds rate (Investopedia, 2018). With that in mind, the plot should highlight rates going up and down depending on federal policy mandate.
Per the definition, these periods closely correspond to rate changes depending on the cycle. Points when the orange line is at 20 correspond to tightening periods; when the line drops to zero, there’s an easing period. As is evident, all the bond rates and the output closely mirror these policy changes, albeit with different levels.
This plot only offers a one-dimensional, inward look at these cycles. There aren’t really any wider macro insights to be gained here about the US economy or market more generally since the predictors only focus on the fed. That said, these periods can easily be picked up so this would be an interesting follow up analysis.
As an aesthetic note, I thought the original plot was too busy with colour so I made all the yields gray while highlighting the output and federal reserve monetary cycles. I think this makes added sense given my intuition that the output seems to be a combination of the input predictors, something which this plot reasonably strengthens given how all the variables move in such close concert during the easing and tightening.
fed_trend <- federal_cycles %>%
select(-Date) %>%
mutate(Tightening = Tightening * 20,
index = 1:nrow(federal_cycles)) %>%
melt(value.name = "interest.rate",
variable.name = "treasury.yield",
id.vars = "index")
fed_trend %>%
ggplot(aes(index, interest.rate, colour = treasury.yield)) +
geom_line(aes(size = treasury.yield)) +
scale_y_continuous(breaks = seq(-20, 30, 5)) +
scale_colour_manual(values = c("lightgray", "lightgray", "lightgray", "lightgray",
"lightgray", "lightgray", "lightgray", "royalblue2",
"darkorange"),
name = "Treasury.Legend",
breaks = c("USGG3M", "Output1", "Tightening"),
labels = c("Bonds", "Output1", "Monetary.Tightening")) +
scale_size_manual(values = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.75, 1.5),
guide = "none") +
guides(colour = guide_legend(override.aes = list(size = 2))) +
labs(title = "Bond yields, output, and federal tightening periods- Includes 2358 days between 1981 and 1992",
subtitle = "Rates and output seem to have positive association with federal activities (i.e. rise in tightening, fall during easing)",
y = "Data and Binary Federal Tightening",
x = "Samples",
caption = "Source: Course Project Treasury Data")
This simple logistic model is useful for understanding whether the federal reserve is either easing or tightening based on the USGG3M interest rate levels. The model summary offers insights towards this end.
Overall, the slope and intercept are very significant as indicated by their respective p-values. This isn’t all that surprising given the two variables are inherently linked from a practical perspective given the federal reserve lowers or raises during a period of easing or tightening.
Moving onto the slope coefficient, the summary indicates that for a one unit increase in USGG3M, the log odds of being in a tightening period increase by about 0.186, or about 1.2% in real terms. This explanation is perfunctory however and requires some more pragmatic context. Harkening back to what the Federal Reserve is doing with tightening, these periods mark an increase in interest rates. This policy direction is therefore, by definition, noted for rising rates. As such, the model picks up this intuitive relationship between a tightening period and interest rate increases. The model verifies a real-world association and signals that as rates rise, the likelihood of being in a tightening period, denoted with log odds, increases.
On top of this, the samples included for the model are only between 1981 and 1992 when the rates tended to be much higher than in the later 1990s and 2000s. As such, the relationship is likely accentuated more than would be if the entire sample were included.
logistic_3M <- glm(Tightening ~ USGG3M,
data = federal_cycles,
family = binomial(link = logit))
summary(logistic_3M)
##
## Call:
## glm(formula = Tightening ~ USGG3M, family = binomial(link = logit),
## data = federal_cycles)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4239 -0.9014 -0.7737 1.3548 1.6743
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.15256 0.17328 -12.422 <2e-16 ***
## USGG3M 0.18638 0.02144 8.694 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2983.5 on 2357 degrees of freedom
## Residual deviance: 2904.8 on 2356 degrees of freedom
## AIC: 2908.8
##
## Number of Fisher Scoring iterations: 4
As always, a visual representation of this association is instructive for better understanding its dynamic. While logistic models deal with binary outcome variables, bivariate plots with regression lines can still be constructed using ggplot2. The output is quite similar to a normal scatterplot with a line but, the y-axis here is either a one or zero signifying the monetary period. Despite this, the logistic line still highlights the relationship even though the outcome variable is binary (owing to a binomial family line function). This further confirms the logistic model summary and shows the positive association between tightening periods and increasing rates.
federal_cycles %>%
ggplot(aes(USGG3M, Tightening)) +
geom_point(alpha = .1, colour = "darkorange") +
geom_smooth(method = "glm", size = 1.5, se = FALSE,
method.args = list(family = "binomial")) +
labs(title = "Federal Reserve easing and tightening periods vs USGG3M yield rates with logistic line",
subtitle = "As yield rates rise, the likeliehood of being in a tightening period (1) increases",
y = "Binary Federal Tightening",
caption = "Source: Course Project Treasury Data")
Continuing with visualizations which aid the model, I’ve plotted the logistic fitted values against the other bonds rates, output, and monetary policy periods. The log odds have been transformed into interest rate units by multiplying the variable by 20 (which is not mathematically derived and is only done for effect). Both the fitted values and original interest rates are highlighted on the plot so as to make comparison easier. The model provides close estimates for the interest rates. Interestingly, the shape of the lines are almost identical but, the fitted values are offset by several points below the actuals. As such, the model seems to provide an output with high, though consistent, bias alongside low variance (i.e. fitted values are consistently low but without much other variation from the original rates).
Despite the transformation being done for visual appeal, it makes the log odds appear favourably to the originals. This is a good sign that the model picks up the association between interest rates and tightening. The clear uptick of fitted values during the known tightening periods and drop during easing further confirms the model seems to capture association between rates and policy quite well.
fed_trend <- data.frame(
index = 1:nrow(federal_cycles),
treasury.yield = "Fitted.Values",
interest.rate = logistic_3M$fitted.values * 20) %>%
bind_rows(fed_trend)
fed_trend %>%
ggplot(aes(index, interest.rate, colour = treasury.yield)) +
geom_line(aes(size = treasury.yield)) +
scale_y_continuous(breaks = seq(-20, 30, 5)) +
scale_colour_manual(values = c("#B452CD", "lightgray", "darkorange", "lightgray",
"lightgray", "lightgray", "royalblue2", "lightgray",
"lightgray", "lightgray"),
name = "Treasury.Legend",
breaks = c("Fitted.Values", "USGG3M", "Output1", "Tightening"),
labels = c("Fitted.Values", "USGG3M",
"Bonds and Output1", "Monetary.Tightening")) +
scale_size_manual(values = c(1, 0.75, 1.5, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1),
guide = "none") +
guides(colour = guide_legend(override.aes = list(size = 2))) +
labs(title = "Bond yields, output, and federal tightening periods- Includes 2358 days between 1981 and 1992",
subtitle = "Rates and output seem to rise in tightening and fall during easing- Logistic regression fitted values follow this trend",
y = "Data and Binary Federal Tightening",
x = "Samples",
caption = "Source: Course Project Treasury Data")
Building on the previous work, I developed a logistic regression model with all the bond categories.
logistic_full <- glm(Tightening ~ USGG3M + USGG6M + USGG2YR +
USGG3YR + USGG5YR + USGG10YR + USGG30YR,
data = federal_cycles,
family = binomial(link = logit))
In stark contrast to the simple logistic model, this summary is filled with contradictions and is difficult to interpret. All the model slopes are very significant, save for the 10 year bond class, which at a .32 p-value is not especially close. Further, the slope coefficients seem almost antithetical: three predictors suggest a decrease in log odds (including USGG3M, which was positive in the simple logistic model) while four indicate an increase for predicting easing and tightening. The log odds also don’t make sense. For example, the USGG3M coefficient shows -3.3456 log odds decrease (about 28% when exponentiated), which is essentially cancelled out by the USGG30YR log odds at 3.3254 (about 28% as well). When viewed in totality, the exponentiated log odds almost cancel each other out given a balance of increases and decreases. All considered, these results are erratic and as a result, model interpretation suffers.
My guess for why this model seems erratic is based on having too many correlated predictors. All the bond categories are highly, if not almost perfectly, correlated and this can obfuscate individual predictor contributions to the model and generally, provide some confusing coefficients. Collinearity, or multi-collinearity in this case, amongst predictors can increase standard error coefficients leading to non-significance. This is a major issue because the model is being used to determine the effect of each predictor during a specific policy cycle.
summary(logistic_full)
##
## Call:
## glm(formula = Tightening ~ USGG3M + USGG6M + USGG2YR + USGG3YR +
## USGG5YR + USGG10YR + USGG30YR, family = binomial(link = logit),
## data = federal_cycles)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2113 -0.8595 -0.5935 1.1306 2.5530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.7552 0.4312 -11.029 < 2e-16 ***
## USGG3M -3.3456 0.2666 -12.548 < 2e-16 ***
## USGG6M 4.1559 0.3748 11.089 < 2e-16 ***
## USGG2YR 3.9460 0.7554 5.224 1.75e-07 ***
## USGG3YR -3.4642 0.9340 -3.709 0.000208 ***
## USGG5YR -3.2115 0.7795 -4.120 3.79e-05 ***
## USGG10YR -0.9705 0.9764 -0.994 0.320214
## USGG30YR 3.3254 0.6138 5.418 6.04e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2983.5 on 2357 degrees of freedom
## Residual deviance: 2629.6 on 2350 degrees of freedom
## AIC: 2645.6
##
## Number of Fisher Scoring iterations: 4
Building on the previous points, developing a model with only two terms, the 3 and 6 month category, further highlights the effect of colinearity. Even with only one extra term, the USGG3M predictor shows a complete change from a single predictor model and becomes negative. Again, I feel this might be driven by colinearity as the extra term inclusion here drives a very big change in predictor coefficients.
logistic_months <- glm(Tightening ~ USGG3M + USGG6M,
data = federal_cycles,
family = binomial(link = logit))
summary(logistic_months)
##
## Call:
## glm(formula = Tightening ~ USGG3M + USGG6M, family = binomial(link = logit),
## data = federal_cycles)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1242 -0.8310 -0.6841 1.2849 1.9546
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4261 0.1802 -13.46 <2e-16 ***
## USGG3M -1.7899 0.1921 -9.32 <2e-16 ***
## USGG6M 1.9353 0.1873 10.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2983.5 on 2357 degrees of freedom
## Residual deviance: 2783.2 on 2355 degrees of freedom
## AIC: 2789.2
##
## Number of Fisher Scoring iterations: 4
The initial model was easy to understand but, did have a higher AIC. This metric is used in model comparison and a lower AIC is generally desirable. However, interpretation is crucial and despite having a better AIC, the full model is more difficult to work with given one of the key goals here is understanding the relationship between interest rates and federal reserve monetary policy cycles. Additionally, the assumptions of AIC, namely that the model residuals are i.i.d Gaussian, are not met so it might not be appropriate in this scenario.
AIC(logistic_3M, logistic_full) %>%
custom_kable()
| df | AIC | |
|---|---|---|
| logistic_3M | 2 | 2908.764 |
| logistic_full | 8 | 2645.579 |
The fitted values plot is very noisy and doesn’t help illuminate what policy period the fed is in. For example, in the first easing window, the values range from about eighteen to zero and fluctuate substantially. This window isn’t atypical either as periods of easing and tightening see erratic fitted value changes which are out of sync with the policy cycles. As aforementioned, the logistic model is being used primarily for driving a more thorough understanding of the easing and tightening periods using interest rates. By plotting the fitted values, its possible to visualize rates and fitted values over time as well. For this reason, the simpler model with greater interpretation seems like a better choice in this case.
fed_trend <- data.frame(
index = 1:nrow(federal_cycles),
treasury.yield = "Fitted.Values",
interest.rate = logistic_full$fitted.values * 20) %>%
bind_rows(fed_trend)
fed_trend %>%
ggplot(aes(index, interest.rate, colour = treasury.yield)) +
geom_line(aes(size = treasury.yield)) +
scale_y_continuous(breaks = seq(-20, 30, 5)) +
scale_colour_manual(values = c("royalblue2", "lightgray", "darkorange", "lightgray",
"lightgray", "lightgray", "lightgray", "lightgray",
"lightgray", "lightgray"),
name = "Treasury.Legend",
breaks = c("Fitted.Values", "Output1", "Tightening"),
labels = c("Fitted.Values", "Bonds & Output1",
"Monetary.Tightening")) +
scale_size_manual(values = c(0.1, 0.1, 1.5, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1),
guide = "none") +
guides(colour = guide_legend(override.aes = list(size = 2))) +
labs(title = "Bond yields, output, and federal tightening periods- Includes 2358 days between 1981 and 1992",
subtitle = "Rates and output seem to rise in tightening and fall during easing- Logistic regression fitted values follow this trend",
y = "Data and Binary Federal Tightening",
x = "Samples",
caption = "Source: Course Project Treasury Data")
However, because I was curious about how each would do in a prediction task, I split up the logistic regression data frame and developed models for USGG3M and all predictors on training data. Using these on a test set, it’s possible to gauge which model had better predictive utility. Neither is all that good, probably in some part owing to an unbalanced sample, but the full model has a higher balanced accuracy (59 % vs 46%). While this is certainly a digression, this provides some contextual support for how an analysis goal helps shape which model is preferable in a given situation. While the prediction task was better using the full model, its interpretation is still lacking for an inquiry based on better understanding relationships.
library(caret)
federal_cycles <- federal_cycles %>%
mutate(Tightening = as.factor(Tightening))
set.seed(1017)
data_split <- createDataPartition(y = federal_cycles$Tightening, p = .7, list = F)
fed_train <- federal_cycles %>%
slice(data_split)
fed_test <- federal_cycles %>%
slice(-data_split)
train_3M <- glm(Tightening ~ USGG3M,
data = federal_cycles,
family = binomial(link = logit))
train_full <- glm(Tightening ~ USGG3M + USGG6M + USGG2YR +
USGG3YR + USGG5YR + USGG10YR + USGG30YR,
data = federal_cycles,
family = binomial(link = logit))
model_predictions <- data.frame(
three_probs = predict(object = train_3M, newdata = fed_test, type = "response"),
full_probs = predict(object = train_full, newdata = fed_test, type = "response")
)
model_predictions <- model_predictions %>%
mutate(three_preds = ifelse(three_probs > .5, "1", "0"),
full_preds = ifelse(full_probs > .5, "1", "0"))
confusionMatrix(model_predictions$three_preds, fed_test$Tightening)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 437 231
## 1 38 0
##
## Accuracy : 0.619
## 95% CI : (0.582, 0.6549)
## No Information Rate : 0.6728
## P-Value [Acc > NIR] : 0.9989
##
## Kappa : -0.1019
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9200
## Specificity : 0.0000
## Pos Pred Value : 0.6542
## Neg Pred Value : 0.0000
## Prevalence : 0.6728
## Detection Rate : 0.6190
## Detection Prevalence : 0.9462
## Balanced Accuracy : 0.4600
##
## 'Positive' Class : 0
##
confusionMatrix(model_predictions$full_preds, fed_test$Tightening)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 437 169
## 1 38 62
##
## Accuracy : 0.7068
## 95% CI : (0.6717, 0.7402)
## No Information Rate : 0.6728
## P-Value [Acc > NIR] : 0.02886
##
## Kappa : 0.2205
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9200
## Specificity : 0.2684
## Pos Pred Value : 0.7211
## Neg Pred Value : 0.6200
## Prevalence : 0.6728
## Detection Rate : 0.6190
## Detection Prevalence : 0.8584
## Balanced Accuracy : 0.5942
##
## 'Positive' Class : 0
##
I’ve put together a faceted plot to review the logistic log odds and probabilities at once. While the log odds have a different scale, it is still the same basic shape as both of the other probabilities. The exponentiated log odds, labeled here as probability, are the same as fitted values so both these lines are identical. This is because the probability is derived from the model’s predicted values, gathered using predict, and these provide the untransformed fitted values. In fact, by adding the term type = response to the predict call provides the converted probabilities. As such, showing both is somewhat redundant but, it does shed light on how R performs certain tasks which is always useful.
logistic_odds <- data.frame(
odds = "log.odds",
values = predict(logistic_full))
logistic_odds <- data.frame(
odds = "probability",
values = 1 / (exp(-logistic_odds$values) + 1)) %>%
bind_rows(logistic_odds)
logistic_odds <- data.frame(
odds = "fitted.values",
values = logistic_full$fitted.values) %>%
bind_rows(logistic_odds)
logistic_odds %>%
mutate(Index = rep(1:nrow(federal_cycles), 3),
odds = factor(odds, levels = c("log.odds", "probability", "fitted.values"))) %>%
ggplot(aes(Index, values, colour = odds)) +
geom_line(size = 1.3) +
facet_wrap(facets = "odds", scales = "free_y", ncol = 1) +
scale_colour_manual(values = c("royalblue2", "darkorange", "#EE6AA7")) +
theme(legend.position = "none") +
labs(title = "Log odds and probabilities for logistic regression model with all predictors",
subtitle = "Both probability (orange line) and fitted values (pink line) are identical despite being derived differently (exp log odds vs glm fitted values)",
y = "Fitted Values & Log-Odds",
x = "Samples",
caption = "Source: Course Project Treasury Data")
This section focuses on developing linear models with more than one predictor, including one with all bond categories. I’ll work to find the most suitable linear model by looking at different combinations of predictors as well.
As with the logistic regression, I’ve put together a data frame with only the variables needed for modelling.
bond_regression <- treasury_data %>%
select(-Date, -Easing, -Tightening)
To start, I’ve put together a full model with all the predictors. This call differs slightly than the logistic model with all predictors since I’ve opted to use the y ~ . to signify all predictors instead of putting in their individual names. However, the output is still the same.
linear_full <- lm(Output1 ~ ., data = bond_regression)
I’ve collected the full model coefficients below. The slope and estimate coefficients are probably the least interesting part of this table. It highlights that a full model would have an intercept around -14.9 and that each predictor contributes between around .3 and .42 to output changes. Taken together, they sum up to about 2.64, which is about the unit change for one of the initial simple linear models. While this is a rough way of thinking about each contribution, given each predictor is so highly correlated I thought it was worth including.
What’s far more interesting here are the standard errors and t-values. Each of the inputs has a standard error of zero. In practical terms, this helps illuminate how far each fitted value is from the actuals; a smaller standard error points to better model fit and predictions. Here, there simply are no standard errors. The resulting significance testing, which utilizes these values, produces massive t-values and extremely significant tests. However, when a model seems too good to be true, there’s cause for thinking something is amiss.
summary(linear_full)$coefficients %>%
custom_kable()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -14.9041591 | 0 | -141024294891 | 0 |
| USGG3M | 0.3839609 | 0 | 3893968285 | 0 |
| USGG6M | 0.3901870 | 0 | 2601053702 | 0 |
| USGG2YR | 0.4151851 | 0 | 1615851177 | 0 |
| USGG3YR | 0.4063541 | 0 | 1231735395 | 0 |
| USGG5YR | 0.3860610 | 0 | 1474449865 | 0 |
| USGG10YR | 0.3477544 | 0 | 1241860763 | 0 |
| USGG30YR | 0.3047124 | 0 | 1945195584 | 0 |
Building on this, the full model variance explanations (R2 and adjusted R2) confirm what I’ve alluded to throughout the analysis, namely that Output1 is some combination of the input variables. The final telltale sign here is perfect variance explanation. Values of 1 for R2 and adjusted R2 indicate there is no variance left unexplained. While this might seem ideal, it has obvious negative implications for any modelling task. Without any variance, in theory a perfect model fit, there really isn’t any model in a traditional sense. Since the output is some combination of the inputs, there’s no error or variance to model and the model simply describes the sample data; the predictors are the output, and the output is comprised of the predictors. With this in mind, the full regression model is not the best choice here given it just describes data negating the need for a model.
unlist(summary(linear_full)[8:9]) %>%
custom_kable()
| r.squared | 1 |
| adj.r.squared | 1 |
The degrees of freedom here refer to the number of predictors being used (8 from n - 1) while the larger number, df2, is the number of total samples minus total predictors (8300 - 8 = 8292 from n - k - 1). These are used during calculations for R2 and adjusted R2.
unlist(summary(linear_full)[7]) %>%
custom_kable()
| df1 | 8 |
| df2 | 8292 |
| df3 | 8 |
As always, it’s good to check the full model residuals. The pattern here is unexpected for a model with an R2 of 1. I thought the all the points would be perfectly aligned on the line straddling zero but, they have a distinct “H” formation. When looking at the scale, these are all essentially zeroes, which is in line with my expectations. However, points to the left of 10 and right of -10 deviate into unique clusters. This would seem to indicate that at higher and lower output ranges, the fitted values are slightly less than perfect. Getting ahead a bit, since the output appears to be some combination, it’s possible it is less representative at these more extreme values (although this is a guess).
autoplot(linear_full, which = 1) +
labs(title = "Fitted vs residuals for full linear model")
I’ve put together a null model here, which only includes an intercept. To construct this model in R, the lm call includes only the output variable and a 1 for input, which denotes the intercept.
linear_null <- lm(Output1 ~ 1, data = bond_regression)
The null model is used to show what the regression would look like without any predictors. It essentially acts as the null hypothesis since the model cannot include any effects and hence, all coefficients must be zero. When looking at the table, this is true as seen with an intercept estimate of zero. Since there isn’t any hypothesis testing to be done, given this is essentially the null hypothesis, there is no t-value and the p-value is 1 (signifying the null hypothesis is true).
summary(linear_null)$coefficients %>%
custom_kable()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0 | 0.0961954 | 0 | 1 |
The previous explanation further situates why there is no variance being explained. To have any value for R2 and adjusted R2, there has to be predictors included which might explain some of the output variance. Without any predictors, there simply isn’t anything to explain variance, hence, by definition, there must be an R2 and adjusted R2 of zero when dealing with the null model.
unlist(summary(linear_null)[8:9]) %>%
custom_kable()
| adj.r.squared | 0 |
| r.squared | 0 |
The anova highlights that there is a significant difference between the null model and the full model, which suggests that there is a relationship between the input variables and the output. In essence, this means that the null hypothesis can be rejected in favour the alternative, which is the difference is not equal to zero (i.e. the model has a slope greater than 0). This isn’t surprising given all the previous work leading up to the test. The analysis of variance is evaluating how much variance is attributable to model versus the null, so the result is in line with the other findings.
anova(linear_full, linear_null)
## Analysis of Variance Table
##
## Model 1: Output1 ~ USGG3M + USGG6M + USGG2YR + USGG3YR + USGG5YR + USGG10YR +
## USGG30YR
## Model 2: Output1 ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8292 0
## 2 8299 637400 -7 -637400 3.73e+22 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I’ve already indicated why I don’t think the full model is preferable here but, to reiterate, since it’s got a perfect fit it doesn’t do much outside of describing data. There needs to be some variance left unexplained for a model to have some wider utility because there’s no need for a model if the fit is perfect.
What constitutes the best model here is tricky though. I’ve established why the full model isn’t useful from a statistical perspective but, I also think the analysis clearly points to the output being a combination of the inputs. With that in mind, models are essentially uncovering how much each variable explains about a combination of all the predictors. The utility of this might be to figure out which predictors by themselves, or in some non-full combination, explain the most for output1, thereby showing which is weighted most in the combination. In any case, there needs to be an R2 less than 1.
All considered, I think the best model is one that strikes a balance between high R2 and interpretability. In essence, I think the most preferable model is parsimonious, one which includes the fewest inputs with the most variance explained. Much like the logistic regression, these models are used for broad understanding so there should be a strong preference towards a model that can be explained and easily interpreted. Following these criteria, the model for USGG3YR is my selection for being the best, or more realistically, the most appropriate for the given situation. Viewing the anova test, the model is still very significant and, as the simple linear section showed, it outperforms other bond categories for variance explained.
anova(linear3YR, linear_null)
## Analysis of Variance Table
##
## Model 1: Output1 ~ USGG3YR
## Model 2: Output1 ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8298 1325
## 2 8299 637400 -1 -636075 3984011 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since I haven’t yet shown any combinations, I’ll use a comparison to highlight why I prefer the single-term model. The chunk below develops a model including USGG3M, USGG6M, and USGG30YR. As the anova shows, its the preferred model and adds more explanatory power when compared to USGG3YR.
linear_mixed <- lm(Output1 ~ USGG3M + USGG6M + USGG30YR, data = bond_regression)
anova(linear_mixed, linear3YR)
## Analysis of Variance Table
##
## Model 1: Output1 ~ USGG3M + USGG6M + USGG30YR
## Model 2: Output1 ~ USGG3YR
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8296 762.53
## 2 8298 1324.83 -2 -562.31 3058.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, the model summary provides insight into why I prefer the USGG3YR model. The same multi-collinearity seen in the logistic section plagues model interpretation here as well. USGG3M shows a negative relationship with the output while the other two remain positive. Again, this doesn’t conform to previous work and by this point, my intuition flags it as suspect as well. Additionally, this model only adds a minor amount to R2 when compared to the USGG3YR model (.9979 vs .9988).
This goes back to selecting a model based on the overall analysis goal. Since it’s fairly clear the output is a derivation of the inputs, the model shouldn’t, or plainly wouldn’t, be used for prediction. This leaves the model as useful for understanding how specific inputs are related to the output, which favours simple, interpretable results. For this, I think the USGG3YR is the best because it provides insight into the relationship between one bond category and the output while having the highest R2 and clear coefficients. As a final point, realistically any of the single term models does a reasonable job highlighting the input and output association but, the USGG3YR also seems to explain the most variance so it’s my choice.
summary(linear_mixed)
##
## Call:
## lm(formula = Output1 ~ USGG3M + USGG6M + USGG30YR, data = bond_regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.02535 -0.22005 -0.00836 0.20314 1.26245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.16756 0.01165 -1387.28 <2e-16 ***
## USGG3M -0.52808 0.01587 -33.28 <2e-16 ***
## USGG6M 2.13856 0.01656 129.14 <2e-16 ***
## USGG30YR 1.20484 0.00317 380.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3032 on 8296 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 2.309e+06 on 3 and 8296 DF, p-value: < 2.2e-16
Essentially, the rollapply function, with specific parameters set for width and by, takes the rolling mean of each bond category. Width tunes how many samples are taken (20) while by is used to set where each rolling mean starts (in this case, every fifth window, or time point). The final dimensions of the rolling means object shows how by works since it’s 1657 rows, or about 8300 divided 5.
window_width <- 20
window_shift <- 5
rolling_means <- rollapply(bond_regression,
width = window_width,
by = window_shift,
by.column = TRUE,
mean)
head(rolling_means) %>%
custom_kable()
| USGG3M | USGG6M | USGG2YR | USGG3YR | USGG5YR | USGG10YR | USGG30YR | Output1 |
|---|---|---|---|---|---|---|---|
| 15.0405 | 14.0855 | 13.2795 | 12.9360 | 12.7825 | 12.5780 | 12.1515 | 20.14842 |
| 15.1865 | 14.1440 | 13.4855 | 13.1085 | 12.9310 | 12.7370 | 12.3370 | 20.55208 |
| 15.2480 | 14.2755 | 13.7395 | 13.3390 | 13.1500 | 12.9480 | 12.5500 | 21.04895 |
| 14.9345 | 14.0780 | 13.7750 | 13.4765 | 13.2385 | 13.0515 | 12.6610 | 21.02611 |
| 14.7545 | 14.0585 | 13.9625 | 13.6890 | 13.4600 | 13.2295 | 12.8335 | 21.31356 |
| 14.6025 | 14.0115 | 14.0380 | 13.7790 | 13.5705 | 13.3050 | 12.8890 | 21.39061 |
It’s fairly straightforward to recreate how this works on a small scale. Basically, the function takes a mean for rows 1 to 20, then skips five rows and takes another for 6 to 25. This process is repeated across the entire data set for all columns (given by.colum is set to true).
roll_mean <- bond_regression %>%
slice(6:25) %>%
summarise(USGG3M_roll = mean(USGG3M),
USGG6M_roll = mean(USGG6M),
USGG2YR_roll = mean(USGG2YR),
USGG3YR_roll = mean(USGG3YR),
USGG5YR_roll = mean(USGG5YR),
USGG10YR_roll = mean(USGG10YR),
USGG30YR_roll = mean(USGG30YR))
bond_regression %>%
slice(1:20) %>%
summarise(USGG3M_roll = mean(USGG3M),
USGG6M_roll = mean(USGG6M),
USGG2YR_roll = mean(USGG2YR),
USGG3YR_roll = mean(USGG3YR),
USGG5YR_roll = mean(USGG5YR),
USGG10YR_roll = mean(USGG10YR),
USGG30YR_roll = mean(USGG30YR)) %>%
bind_rows(roll_mean) %>%
custom_kable()
| USGG3M_roll | USGG6M_roll | USGG2YR_roll | USGG3YR_roll | USGG5YR_roll | USGG10YR_roll | USGG30YR_roll |
|---|---|---|---|---|---|---|
| 15.0405 | 14.0855 | 13.2795 | 12.9360 | 12.7825 | 12.578 | 12.1515 |
| 15.1865 | 14.1440 | 13.4855 | 13.1085 | 12.9310 | 12.737 | 12.3370 |
The next chunk highlights these windows which I described. Starting at 1 and moving across the row, the final column value is 20. The next row is the 6 to 25 window which was used to calculate the mean above as well.
window_count <- seq(bond_regression$USGG3M)
rolling_window <- rollapply(window_count,
width = window_width,
by = window_shift,
by.column = FALSE,
FUN = function(z) z)
head(rolling_window, 10) %>%
custom_kable()
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 |
| 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 |
| 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 |
| 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 |
| 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
| 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 |
| 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 |
This object helps set up filtering so when these values need to be plotted, they can be placed into the right spot with their corresponding USGG3M.
calculation_points <- rolling_window[,10]
c(calculation_length = length(calculation_points)) %>%
custom_kable()
| calculation_length | 1657 |
My code diverges from the provided example here. The original code works towards setting up a matrix with USGG3M in one column and the rolling mean calculation in another. Using this, the variables are visualized. I’ve done the same thing but, used a data frame so I can utilize ggplot2 instead.
plot_means <- bond_regression %>%
select(USGG3M) %>%
mutate(mean_calculation = 0)
plot_means$mean_calculation[calculation_points] <- rolling_means[,1]
head(plot_means, 20) %>%
custom_kable()
| USGG3M | mean_calculation |
|---|---|
| 13.52 | 0.0000 |
| 13.58 | 0.0000 |
| 14.50 | 0.0000 |
| 14.76 | 0.0000 |
| 15.20 | 0.0000 |
| 15.22 | 0.0000 |
| 15.24 | 0.0000 |
| 15.08 | 0.0000 |
| 15.25 | 0.0000 |
| 15.15 | 15.0405 |
| 15.79 | 0.0000 |
| 15.32 | 0.0000 |
| 15.71 | 0.0000 |
| 15.72 | 0.0000 |
| 15.70 | 15.1865 |
| 15.35 | 0.0000 |
| 15.20 | 0.0000 |
| 15.06 | 0.0000 |
| 14.87 | 0.0000 |
| 14.59 | 15.2480 |
However, I still get the same plot, albeit formatted differently. As seen, the rolling mean points follow the general trend line of USGG3M.
plot_means %>%
mutate(index = seq(plot_means$mean_calculation),
mean_marker = ifelse(mean_calculation != 0, T, F)) %>%
ggplot(aes(x = index, colour = mean_marker)) +
geom_point(aes(y = mean_calculation, alpha = mean_marker), size = .75) +
geom_line(aes(y = USGG3M), colour = "darkgray", size = 2, alpha = .4) +
scale_alpha_discrete(range = c(0, 1), guide = FALSE) +
scale_colour_manual(values = c("white", "darkorange"),
name = "Rolling Calculation",
breaks = "TRUE",
labels = "mean") +
labs(title = "USGG3M bond yields (gray line) and rolling mean values (orange dots)",
y = "USGG3M (Value and Rolling Mean)",
x = "Index",
caption = "Source: Course Project Treasury Data")
Executing this requires making a matrix containing the difference between variables before a rollaply for the standard deviation can be applied. To do this, I’ve used apply with the diff function to take the difference of each column (marked by 2). Thereafter, I’ve added in the dates using rownames so the samples can be verified.
sd_diff <- apply(bond_regression, 2, diff)
rownames(sd_diff) <- treasury_data$Date[-1]
head(sd_diff) %>%
custom_kable()
| USGG3M | USGG6M | USGG2YR | USGG3YR | USGG5YR | USGG10YR | USGG30YR | Output1 | |
|---|---|---|---|---|---|---|---|---|
| 4023 | 0.06 | 0.07 | 0.14 | 0.03 | -0.08 | -0.04 | 0.00 | 0.0758722 |
| 4024 | 0.92 | 0.74 | 0.50 | 0.47 | 0.40 | 0.27 | 0.22 | 1.3559162 |
| 4025 | 0.26 | 0.10 | 0.17 | 0.17 | 0.07 | -0.03 | 0.02 | 0.3011961 |
| 4026 | 0.44 | 0.30 | 0.44 | 0.33 | 0.20 | 0.22 | 0.22 | 0.8235321 |
| 4029 | 0.02 | -0.07 | -0.36 | -0.34 | -0.17 | -0.12 | -0.05 | -0.4298574 |
| 4030 | 0.02 | -0.13 | 0.13 | 0.03 | -0.03 | 0.08 | 0.00 | 0.0393581 |
From there, the same rollapply method used for means can be done for standard deviation. I’ve also added the date variable back into the bond regression set as it’s needed for visualizations.
Before moving on, I’ll address what the rolling standard deviation might be useful for. Standard deviation here essentially comprises a volatility rating. Since the rates are reasonably homogeneous in short date windows, any increased standard deviation marks rate fluctuations. With this in mind, it will be useful for picking out adverse financial events. Major events to watch for include the 2008 recession, which act as an intuitive benchmark to gauge how well this rolling standard deviation captures volatility.
bond_regression <- treasury_data %>%
select(Date) %>%
bind_cols(bond_regression)
rolling_sd <- rollapply(sd_diff,
width = window_width,
by = window_shift,
by.column = TRUE,
sd)
head(rolling_means) %>%
custom_kable()
| USGG3M | USGG6M | USGG2YR | USGG3YR | USGG5YR | USGG10YR | USGG30YR | Output1 |
|---|---|---|---|---|---|---|---|
| 15.0405 | 14.0855 | 13.2795 | 12.9360 | 12.7825 | 12.5780 | 12.1515 | 20.14842 |
| 15.1865 | 14.1440 | 13.4855 | 13.1085 | 12.9310 | 12.7370 | 12.3370 | 20.55208 |
| 15.2480 | 14.2755 | 13.7395 | 13.3390 | 13.1500 | 12.9480 | 12.5500 | 21.04895 |
| 14.9345 | 14.0780 | 13.7750 | 13.4765 | 13.2385 | 13.0515 | 12.6610 | 21.02611 |
| 14.7545 | 14.0585 | 13.9625 | 13.6890 | 13.4600 | 13.2295 | 12.8335 | 21.31356 |
| 14.6025 | 14.0115 | 14.0380 | 13.7790 | 13.5705 | 13.3050 | 12.8890 | 21.39061 |
I’m collecting the dates that accord to the rolling standard deviation measurements so they can be merged with the calculated values and analyzed. In the previous chunk I reintroduced date into the data frame and here, I’m using rollapply with a customized paste function the collect the samples. This yields a large character matrix which captures the rolling window dates.
rolling_dates <- rollapply(bond_regression[-1,1],
width = window_width,
by = window_shift,
by.column = FALSE,
function(x) paste0(x))
head(rolling_dates) %>%
custom_kable()
| 1981-01-06 | 1981-01-07 | 1981-01-08 | 1981-01-09 | 1981-01-12 | 1981-01-13 | 1981-01-14 | 1981-01-15 | 1981-01-16 | 1981-01-19 | 1981-01-20 | 1981-01-21 | 1981-01-22 | 1981-01-23 | 1981-01-26 | 1981-01-27 | 1981-01-28 | 1981-01-29 | 1981-01-30 | 1981-02-02 |
| 1981-01-13 | 1981-01-14 | 1981-01-15 | 1981-01-16 | 1981-01-19 | 1981-01-20 | 1981-01-21 | 1981-01-22 | 1981-01-23 | 1981-01-26 | 1981-01-27 | 1981-01-28 | 1981-01-29 | 1981-01-30 | 1981-02-02 | 1981-02-03 | 1981-02-04 | 1981-02-05 | 1981-02-06 | 1981-02-09 |
| 1981-01-20 | 1981-01-21 | 1981-01-22 | 1981-01-23 | 1981-01-26 | 1981-01-27 | 1981-01-28 | 1981-01-29 | 1981-01-30 | 1981-02-02 | 1981-02-03 | 1981-02-04 | 1981-02-05 | 1981-02-06 | 1981-02-09 | 1981-02-10 | 1981-02-11 | 1981-02-13 | 1981-02-17 | 1981-02-18 |
| 1981-01-27 | 1981-01-28 | 1981-01-29 | 1981-01-30 | 1981-02-02 | 1981-02-03 | 1981-02-04 | 1981-02-05 | 1981-02-06 | 1981-02-09 | 1981-02-10 | 1981-02-11 | 1981-02-13 | 1981-02-17 | 1981-02-18 | 1981-02-19 | 1981-02-20 | 1981-02-23 | 1981-02-24 | 1981-02-25 |
| 1981-02-03 | 1981-02-04 | 1981-02-05 | 1981-02-06 | 1981-02-09 | 1981-02-10 | 1981-02-11 | 1981-02-13 | 1981-02-17 | 1981-02-18 | 1981-02-19 | 1981-02-20 | 1981-02-23 | 1981-02-24 | 1981-02-25 | 1981-02-26 | 1981-02-27 | 1981-03-02 | 1981-03-03 | 1981-03-04 |
| 1981-02-10 | 1981-02-11 | 1981-02-13 | 1981-02-17 | 1981-02-18 | 1981-02-19 | 1981-02-20 | 1981-02-23 | 1981-02-24 | 1981-02-25 | 1981-02-26 | 1981-02-27 | 1981-03-02 | 1981-03-03 | 1981-03-04 | 1981-03-05 | 1981-03-06 | 1981-03-09 | 1981-03-10 | 1981-03-11 |
Finally, I’ve unified all the composite pieces in a single data frame with all the required rolling standard deviation calculations for USGG3M, USGG5YR, USGG30YR, and Output1. As with before, the samples for the middle of the dates window (hence the 10th column).
rownames(rolling_sd) <- rolling_dates[,10]
plot_sd <- as.data.frame(rolling_sd) %>%
add_rownames(var = "Date") %>%
select(Date, USGG3M, USGG5YR, USGG30YR, Output1)
head(plot_sd) %>%
custom_kable()
| Date | USGG3M | USGG5YR | USGG30YR | Output1 |
|---|---|---|---|---|
| 1981-01-19 | 0.3433258 | 0.1713192 | 0.1202147 | 0.5639875 |
| 1981-01-26 | 0.2933383 | 0.1450082 | 0.1192201 | 0.4707427 |
| 1981-02-02 | 0.2613180 | 0.1654110 | 0.1351909 | 0.4681168 |
| 1981-02-09 | 0.2551754 | 0.1717219 | 0.1422183 | 0.4786189 |
| 1981-02-18 | 0.2480551 | 0.1744767 | 0.1516540 | 0.4888569 |
| 1981-02-25 | 0.1963884 | 0.1822917 | 0.1537351 | 0.4788897 |
The plot below displays the pairwise coefficients from the previous linear regressions. The coefficients for each bond category represent the unit association with Output1 during the high volatility days. The model intercept, where Output1 falls without any variable interaction, can be compared to each category coefficient.
Across the top row, there is a correlation coefficient for each bond category in relation to the intercept. Not surprisingly, two are negative associations and one is positive. I expected a mix of signs owing to collinearity, which has been reviewed in previous sections and once again appears here. The negative correlation for USSG3M is just in the weak category while the USGG30YR crosses into medium strength. The USGG5YR is weak positive but, the interpretation suffers due to collinearity. Overall, there seems to be some evidence for a negative correlation between high volatility days and downward pressure on rates. The above correlations are also captured in the first column where each slope coefficient is plotted against the corresponding intercept value for high volatility days. I’ve included regression lines to better situate the associations.
The other scatterplots show the interaction between the various bond category slope coefficients. The strongest relationship is between USGG5YR and USGG30YR, which has a strong, negative correlation (-.833). This captures how both slope coefficients move strongly down during high volatility periods. The remaining two pairs plots are less significant given they show very weak correlations. Taking the macro view though, I think intuitively these slope coefficients should move downwards during high volatility, which would be more evident if each model was done individually to avoid collinearity. That said, outside of intuition, there is still evidence here that there is a negative association between rates and high volatility periods.
coefficients <- as.data.frame(coefficients)
regression_pairs <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method = lm, color = "blue", ...)
p
}
coefficients %>%
ggpairs(title = "Pairwise plot for USGG3M, USGG5YR, and USGG30YR",
lower = list(continuous = regression_pairs))
To assess this question, it’s important to keep the correlations in mind from the previous plot. I’ve developed a specific correlation plot here to re-emphasize how the respective bond category coefficients are correlated. At a high level, the takeaway here is USGG3M has weak correlation with both the other categories while USGG5YR and USGG30YR have a strong negative correlation.
corplot_df <- cor(coefficients[,-1])
corrplot.mixed(corplot_df,
title = "High volatility linear model correlations",
mar = c(0, 0, 1, 0))
Continuing the pairs comparison, it’s clear here that the USGG3M (red line) has a different shape than the other two lines. This is punctuated starting around the financial crisis when the category has erratic coefficients. In other periods, like the early 2000s, the category is also noticeably out of sync with the other two lines. These relationships were relatively evident in both the bivariate scatterplot and correlation plot but, they also appear well enough here. Intuitively, this makes sense given the underlying data hasn’t changed, only the visualization medium has.
volatility_lm <- melt(coefficients,
variable.name = "bond_category",
value.name = "coefficients")
volatility_lm %>%
filter(bond_category %in% c("USGG3M", "USGG5YR", "USGG30YR")) %>%
mutate(Date = rep(as.Date(rolling_dates[,10]), 3)) %>%
ggplot(aes(Date, coefficients, colour = bond_category)) +
geom_line(size = 1.3, alpha = .2) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
labs(title = "Linear coefficients for USGG3M, USGG5YR, and USGG30YR during high bond volatility periods",
y = "Coefficients",
x = "Date",
caption = "Source: Course Project Treasury Data")
Getting a clearer, untangled view of the lines helps here as well. The USGG5YR and USGG30YR look broadly similar and the USGG3M is noticeably different in some parts. As such, these visualizations are concurrent with findings emanating from the pairs plot. That said, I think this is a worse format just because a direct bivariate comparison cannot be made. While the univariate trend lines help illuminate each bond category over time more effectively, they are not as well suited for comparative analysis.
volatility_lm %>%
filter(bond_category %in% c("USGG3M", "USGG5YR", "USGG30YR")) %>%
mutate(Date = rep(as.Date(rolling_dates[,10]), 3)) %>%
ggplot(aes(Date, coefficients, colour = bond_category)) +
geom_line(size = 1.3, alpha = .6, show.legend = F) +
facet_wrap(facets = "bond_category", ncol = 1) +
geom_hline(yintercept = 0, colour = "mediumorchid3", alpha = .8, size = 1.3) +
labs(title = "Linear coefficients for USGG3M, USGG5YR, and USGG30YR during high bond volatility periods",
y = "Coefficients",
x = "Date",
caption = "Source: Course Project Treasury Data")
As per the analysis, I’ve included the high slope spreads and jump slopes. The first, high slope spreads, highlights major differences (greater than 3) between the five and thirty-year bond slope coefficients. There are 26 such instances in the coefficients set.
coefficients %>%
mutate(Date = as.Date(rolling_dates[,10]),
high_slope_spread = USGG5YR - USGG30YR) %>%
filter(high_slope_spread > 3) %>%
select(Date, high_slope_spread) %>%
custom_kable()
| Date | high_slope_spread |
|---|---|
| 1982-04-26 | 3.253311 |
| 1982-12-15 | 3.026368 |
| 1985-09-16 | 3.097777 |
| 1987-05-12 | 3.026688 |
| 1987-05-19 | 3.542911 |
| 1987-05-27 | 3.450011 |
| 1987-09-25 | 3.103604 |
| 1988-03-15 | 3.065761 |
| 1988-09-27 | 3.112827 |
| 1988-10-05 | 3.122279 |
| 1989-03-10 | 3.701466 |
| 1992-02-05 | 3.216466 |
| 1994-08-03 | 3.029625 |
| 1994-12-08 | 3.602118 |
| 1996-06-14 | 3.376506 |
| 1997-05-09 | 3.054137 |
| 1997-05-16 | 3.063906 |
| 1997-05-23 | 4.009702 |
| 1997-05-30 | 3.197572 |
| 2000-12-26 | 3.768436 |
| 2001-01-02 | 3.293521 |
| 2001-07-25 | 4.175424 |
| 2001-08-01 | 4.308487 |
| 2003-11-13 | 3.244270 |
| 2004-08-12 | 3.176460 |
| 2004-12-16 | 4.199836 |
The jump slope highlights only one instance where USGG5YR was above 3 for the slope coefficient.
coefficients %>%
mutate(Date = as.Date(rolling_dates[,10])) %>%
filter(USGG5YR > 3) %>%
select(Date, USGG5YR) %>%
custom_kable()
| Date | USGG5YR |
|---|---|
| 2004-12-16 | 3.195845 |
As a next step, I’ve put together the rolling R2 values for the high volatility periods. It’s fairly standard work at this point with the change that rollapply captures the R2 here. Since I already have the rolling dates object, it’s easy to put them into a new data frame with the R2 values.
r_squared <- rollapply(bond_regression[,-1],
width = window_width,
by = window_shift,
by.column = FALSE,
FUN = function(z) summary(lm(Output1 ~ USGG3M + USGG5YR + USGG30YR, data = as.data.frame(z)))$r.squared)
r_squared <- as.data.frame(r_squared) %>%
mutate(Date = as.Date(rolling_dates[,10])) %>%
select(Date, r_squared)
head(r_squared, 10) %>%
custom_kable()
| Date | r_squared |
|---|---|
| 1981-01-16 | 0.9950463 |
| 1981-01-23 | 0.9924859 |
| 1981-01-30 | 0.9986412 |
| 1981-02-06 | 0.9988491 |
| 1981-02-17 | 0.9979588 |
| 1981-02-24 | 0.9964898 |
| 1981-03-03 | 0.9977975 |
| 1981-03-10 | 0.9989634 |
| 1981-03-17 | 0.9987294 |
| 1981-03-24 | 0.9970730 |
The plot highlights that the R2 values are extremely high throughout but, with some notable exceptions. While the R2 mean rests around .99, on several windows it goes as low as about .82. Why might this be happening?
r_squared %>%
ggplot(aes(Date, r_squared)) +
geom_jitter(alpha = .2, colour = "dodgerblue2") +
geom_hline(yintercept = mean(r_squared$r_squared),
colour = "mediumorchid3", alpha = .8, size = 1.3) +
scale_y_continuous(breaks = seq(.8, 1, .02)) +
labs(title = "R squared for model including USGG3M, USGG5YR, and USGG30YR during high bond volatility periods",
subtitle = "Mean R2 value (purple line) shows very high variance explanation by model",
y = "R Squared",
x = "Date",
caption = "Source: Course Project Treasury Data")
To start a decrease in R2 signals that the model is not explaining as much of the total variance in the output. This would occur at time when there is a difference in the otherwise very strong correlation (given R2 is a transformed version of correlation, R). Below, the dates when R2 slips below .9 are highlighted.
r_squared %>%
filter(r_squared < .9) %>%
custom_kable()
| Date | r_squared |
|---|---|
| 1987-06-24 | 0.8227914 |
| 1991-06-27 | 0.8812476 |
| 2005-04-28 | 0.8801402 |
| 2012-06-20 | 0.8988880 |
Volatility is the first thing that comes to mind as a cause for decreased R2. If some window volatility caused the output to fluctuate but, not in complete concert with predictors, R2 could reasonably drop. However, this doesn’t really come across in the table below. I’m obviously using a rough eye test of the volatility here but, not major differences pop out.
plot_sd %>%
filter(Date %in% c("1987-06-25", "1991-06-28", "2005-04-29", "2012-06-21")) %>%
group_by(bond_category) %>%
custom_kable()
| Date | bond_category | rolling_sd |
|---|---|---|
| 1987-06-25 | USGG3M | 0.0579687 |
| 1991-06-28 | USGG3M | 0.0272636 |
| 2005-04-29 | USGG3M | 0.0425541 |
| 2012-06-21 | USGG3M | 0.0070446 |
| 1987-06-25 | USGG5YR | 0.0659105 |
| 1991-06-28 | USGG5YR | 0.0387902 |
| 2005-04-29 | USGG5YR | 0.0655362 |
| 2012-06-21 | USGG5YR | 0.0351445 |
| 1987-06-25 | USGG30YR | 0.0633173 |
| 1991-06-28 | USGG30YR | 0.0357539 |
| 2005-04-29 | USGG30YR | 0.0471661 |
| 2012-06-21 | USGG30YR | 0.0498544 |
| 1987-06-25 | Output1 | 0.1277012 |
| 1991-06-28 | Output1 | 0.0880148 |
| 2005-04-29 | Output1 | 0.1242751 |
| 2012-06-21 | Output1 | 0.0574567 |
Nor is a major volatility difference evident in the plot below which compares the median volatility across the three bond categories and Output1. Again, this is a rough comparison given I’ve combined the bond volatility and it also does not consider any covariance or slope but, volatility differences don’t seem to be a huge influence.
as.data.frame(rolling_sd) %>%
select(USGG3M, USGG5YR, USGG30YR, Output1) %>%
mutate(sd_apply = apply(rolling_sd[,1:3], 1, median),
Date = plot_sd$Date[1:1656],
low_rsquare = ifelse(Date %in% c("1987-06-25", "1991-06-28",
"2005-04-29", "2012-06-21"),
"low R2 day", "outside scope")) %>%
ggplot(aes(sd_apply, Output1, colour = low_rsquare)) +
geom_jitter() +
labs(title = "Volatility comparison for Output1 and median combination of USGG3M, USGG5YR, and USGG30YR",
subtitle = "Low R2 days don't seem to stick out as major volatilty difference days",
y = "Output1 rolling sd",
caption = "Source: Course Project Treasury Data")
I also thought plotting markers highlighting these low R2 days might be useful. I mentioned there might be days when the slope changes in bond categories and the output might not align leading to a lower R2. Each red line highlights the lowest R2 dates, which do seem to occur on steep changes for output. However, there seem to be steeper changes and more obvious disconnects between bond rates and output1 so this is inconclusive, though still interesting.
bond_trend <- treasury_data %>%
select(-Easing, -Tightening) %>%
melt(value.name = "interest.rate",
variable.name = "treasury.yield",
id.vars = "Date")
bond_trend %>%
ggplot(aes(Date, interest.rate, colour = treasury.yield)) +
geom_line(size = 1.3, alpha = .4) +
scale_y_continuous(breaks = seq(-20, 30, 5)) +
geom_hline(yintercept = 0, colour = "royalblue2", size = 1.3, alpha = .3) +
geom_vline(xintercept = as.Date(c("1987-06-25", "1991-06-28",
"2005-04-29", "2012-06-21")),
colour = "#EE6363", size = 1.3, linetype = "dashed",
alpha = .75) +
guides(colour = guide_legend(override.aes = list(size = 2))) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
labs(title = "Assessing low R2 days using time series plot of all bond classes and output",
subtitle = "Low R2 seem to occur on days with steep changes in output; results are observational though so nothing confirmed",
caption = "Source: Course Project Treasury Data",
y = "interest rate (and unknown output values)")
Another idea for the lowered R2 comes from a broader extrapolation about the predictors and Output1. I’ve already shown that with a full model, the R2 is a perfect 1. With this in mind, it seems reasonable to assume that the other four bond categories explain the missing variance not included here. It could be these four bond categories do a better job of explaining the output on these days and even some volatility throws off the included three enough to diminish the R2. As such, this could possibly be a few streaks of randomness where these three happen to be lower than usual. Overall though, despite not having a perfect causal explanation, there is at least some instructive theories to work with.
The rolling p-values are again obtained using rollapply. These p-values are from the slope tests for each bond category and the intercept for Output1. Given this, they signify which variables are significant in the wider three term model or whether the intercept coefficient is significant.
p_values <- rollapply(bond_regression[,-1],
width = window_width,
by = window_shift,
by.column = FALSE,
FUN = function(z) summary(lm(Output1 ~ USGG3M + USGG5YR + USGG30YR, data = as.data.frame(z)))$coefficients[,4])
p_values <- as.data.frame(p_values) %>%
mutate(Date = as.Date(rolling_dates[,10])) %>%
rename(Intercept = "(Intercept)")
p_values %>%
select(Date, Intercept, USGG3M, USGG5YR, USGG30YR) %>%
head(10) %>%
custom_kable()
| Date | Intercept | USGG3M | USGG5YR | USGG30YR |
|---|---|---|---|---|
| 1981-01-16 | 0 | 0 | 0.0000002 | 0.2538852 |
| 1981-01-23 | 0 | 0 | 0.0000002 | 0.0053009 |
| 1981-01-30 | 0 | 0 | 0.0000000 | 0.0000363 |
| 1981-02-06 | 0 | 0 | 0.0000000 | 0.0000222 |
| 1981-02-17 | 0 | 0 | 0.0000000 | 0.0001332 |
| 1981-02-24 | 0 | 0 | 0.0000009 | 0.0047335 |
| 1981-03-03 | 0 | 0 | 0.0000034 | 0.0601047 |
| 1981-03-10 | 0 | 0 | 0.0000168 | 0.3851840 |
| 1981-03-17 | 0 | 0 | 0.0000146 | 0.5025726 |
| 1981-03-24 | 0 | 0 | 0.0006413 | 0.3052705 |
Plotting these values highlights which predictor slope coefficients are significant. Perhaps more telling, it clearly shows which inputs have the most non-significant coefficients. At a glance, it looks like the USGG30YR term has the most non-significant p-values by a large margin. Beyond that, it appears USGG3M has the most significant coefficients. It also looks like the intercept is always significant.
At first glance, it might be surprising to have so many non-significant predictors come up in a model that is almost certainly always significant (judging by the high R2s and the linear model summaries from previous sections). Again though, the effects of multicollinearity likely influence this result.
plot_pvalue <- melt(p_values,
id.vars = "Date",
variable.name = "bond_category",
value.name = "rolling_pvalue")
plot_pvalue %>%
ggplot(aes(Date, rolling_pvalue, colour = bond_category, shape = bond_category)) +
geom_jitter(size = 4, alpha = .75) +
geom_hline(yintercept = .05, colour = "darkorange", alpha = .3, size = 1.5) +
scale_shape_manual(values = 49:52) +
guides(alpha = guide_legend(override.aes = list(alpha = 1))) +
labs(title = "Slope and intercept p-values for model containing Output1 ~ USGG3M, USGG5YR, and USGG30YR",
subtitle = "USGG30YR appears to have the most non-significant slope coefficients in rolling lm (orange line is .05 p-value threshold)",
caption = "Source: Course Project Treasury Data")
Confirming the previous intuition stemming from the p-value plot, the table below shows that USGG30YR had 773 p-values over the significance threshold of .05 (about 47% of all samples), which was by far the most of all the predictors.
plot_pvalue %>%
group_by(bond_category) %>%
filter(rolling_pvalue > .05) %>%
count() %>%
summarise(non_significant_n = n,
non_significant_percent = round(n / 1657 * 100)) %>%
arrange(desc(non_significant_n)) %>%
custom_kable()
| bond_category | non_significant_n | non_significant_percent |
|---|---|---|
| USGG30YR | 773 | 47 |
| USGG3M | 156 | 9 |
| USGG5YR | 32 | 2 |
The following few chunks also highlight what dates these high p-values correspond to using a .5 threshold. This includes 312 dates total, which I’ve truncated to include five samples from each bond category (for space saving purposes). All samples can be seen by removing the slice function.
plot_pvalue %>%
group_by(bond_category) %>%
filter(rolling_pvalue > .5) %>%
select(Date, rolling_pvalue) %>%
arrange(desc(rolling_pvalue)) %>%
slice(1:5) %>%
custom_kable()
| bond_category | Date | rolling_pvalue |
|---|---|---|
| USGG3M | 1992-07-15 | 0.9728457 |
| USGG3M | 2002-05-23 | 0.9558409 |
| USGG3M | 2009-10-13 | 0.9155319 |
| USGG3M | 2000-11-07 | 0.8972432 |
| USGG3M | 2011-11-23 | 0.8940151 |
| USGG5YR | 1987-09-11 | 0.9764203 |
| USGG5YR | 1987-06-24 | 0.9277380 |
| USGG5YR | 1982-12-01 | 0.9113957 |
| USGG5YR | 1987-09-03 | 0.8060915 |
| USGG5YR | 1999-12-03 | 0.7327210 |
| USGG30YR | 1999-06-25 | 0.9992457 |
| USGG30YR | 1994-03-16 | 0.9985098 |
| USGG30YR | 2001-07-11 | 0.9983685 |
| USGG30YR | 2007-11-16 | 0.9956528 |
| USGG30YR | 1996-12-06 | 0.9920443 |
The final section of the project deals with Principal Component Analysis (PCA). This statistical method breaks down predictors into new variables that are orthogonal and thereby uncorrelated. Given the analysis has shown how highly correlated the bond categories are, PCA seems ideal for this data set.
Beginning the analysis, I’ve created a PCA specific set that only includes the bond category predictors. PCA is not performed on the outcome variable, so Output1 is omitted.
pca_set <- treasury_data %>%
select(USGG3M, USGG6M, USGG2YR, USGG3YR, USGG5YR, USGG10YR, USGG30YR)
c(pca_rows = dim(pca_set)[1], pca_columns = dim(pca_set)[2]) %>%
custom_kable()
| pca_rows | 8300 |
| pca_columns | 7 |
Taking a quick look at the new data frame, the PCA set can be seen below.
head(pca_set) %>%
custom_kable()
| USGG3M | USGG6M | USGG2YR | USGG3YR | USGG5YR | USGG10YR | USGG30YR |
|---|---|---|---|---|---|---|
| 13.52 | 13.09 | 12.289 | 12.28 | 12.294 | 12.152 | 11.672 |
| 13.58 | 13.16 | 12.429 | 12.31 | 12.214 | 12.112 | 11.672 |
| 14.50 | 13.90 | 12.929 | 12.78 | 12.614 | 12.382 | 11.892 |
| 14.76 | 14.00 | 13.099 | 12.95 | 12.684 | 12.352 | 11.912 |
| 15.20 | 14.30 | 13.539 | 13.28 | 12.884 | 12.572 | 12.132 |
| 15.22 | 14.23 | 13.179 | 12.94 | 12.714 | 12.452 | 12.082 |
To reaffirm the extreme correlation between predictors, I put together a pairwise plot below. As seen, the predictors are highly, and almost perfectly, correlated.
pca_set %>%
select(USGG3M, USGG2YR, USGG5YR) %>%
ggpairs(title = "Pairwise plot for USGG3M, USGG5YR, and USGG30YR",
lower = list(continuous = regression_pairs))
Code for a 3D representation of this plot can be found below (although it will not produce an output in the R markdown file).
pca_set %>%
select(USGG3M, USGG2YR, USGG5YR) %>%
rgl.points()
The covariance matrix is an essential component of PCA. The technique relies on getting the eigenvalues for this matrix, which make up the amount of variance each principle component explains (which is generally scaled to a percentage). At the same time, PCA factor scores are derived by multiplying the centred original data set (i.e. each column has its mean subtracted) with the resulting eigenvectors. Additionally, the eigenvector matrix comprises the PCA loadings. Given this, a necessary starting point is developing this covariance object.
While R has a built-in function to get this matrix, cov, I’ve developed my own function to illuminate how the process works. The covariance matrix is found by centring each variable in the data set, which in this case means taking the average yield rate by bond category and subtracting it from each rate value. From there, the new centred matrix is multiplied against itself. For this to work, the first matrix term has to be transposed (7 x 8300 * 8300 x 7). The resulting square, symmetric matrix (7 x 7) is the unnormalized covariance matrix. To perform the normalization, each matrix element is divided by the number of rows in the PCA set minus one (accounting for degrees of freedom). Following this, the covariance matrix is fully constructed. The function I developed, manual_covariance, follows these steps as seen in the chunk below.
manual_covariance <- function(dataset){
centred_matrix <- as.matrix(scale(dataset, scale = F))
covariance_matrix <- t(centred_matrix) %*% centred_matrix
covariance_matrix <- covariance_matrix / (nrow(dataset) - 1)
custom_kable(covariance_matrix)
}
manual_covariance(pca_set)
| USGG3M | USGG6M | USGG2YR | USGG3YR | USGG5YR | USGG10YR | USGG30YR | |
|---|---|---|---|---|---|---|---|
| USGG3M | 11.760393 | 11.855287 | 12.303031 | 11.942035 | 11.188856 | 9.924865 | 8.587987 |
| USGG6M | 11.855287 | 12.000510 | 12.512434 | 12.158422 | 11.406959 | 10.128890 | 8.768702 |
| USGG2YR | 12.303031 | 12.512434 | 13.284203 | 12.977542 | 12.279514 | 11.005377 | 9.600181 |
| USGG3YR | 11.942035 | 12.158422 | 12.977542 | 12.708647 | 12.068078 | 10.856033 | 9.497246 |
| USGG5YR | 11.188856 | 11.406959 | 12.279514 | 12.068078 | 11.543082 | 10.463386 | 9.212159 |
| USGG10YR | 9.924865 | 10.128890 | 11.005377 | 10.856033 | 10.463386 | 9.583483 | 8.510632 |
| USGG30YR | 8.587987 | 8.768702 | 9.600181 | 9.497246 | 9.212159 | 8.510632 | 7.624304 |
To confirm if the new function returns the correct matrix, I’ve check it against the cov function. As seen, the results are the same.
covariance_matrix <- as.data.frame(cov(pca_set))
covariance_matrix %>%
custom_kable()
| USGG3M | USGG6M | USGG2YR | USGG3YR | USGG5YR | USGG10YR | USGG30YR | |
|---|---|---|---|---|---|---|---|
| USGG3M | 11.760393 | 11.855287 | 12.303031 | 11.942035 | 11.188856 | 9.924865 | 8.587987 |
| USGG6M | 11.855287 | 12.000510 | 12.512434 | 12.158422 | 11.406959 | 10.128890 | 8.768702 |
| USGG2YR | 12.303031 | 12.512434 | 13.284203 | 12.977542 | 12.279514 | 11.005377 | 9.600181 |
| USGG3YR | 11.942035 | 12.158422 | 12.977542 | 12.708647 | 12.068078 | 10.856033 | 9.497246 |
| USGG5YR | 11.188856 | 11.406959 | 12.279514 | 12.068078 | 11.543082 | 10.463386 | 9.212159 |
| USGG10YR | 9.924865 | 10.128890 | 11.005377 | 10.856033 | 10.463386 | 9.583483 | 8.510632 |
| USGG30YR | 8.587987 | 8.768702 | 9.600181 | 9.497246 | 9.212159 | 8.510632 | 7.624304 |
The covariance matrix can be visualized using a slightly modified contour plot. Normally, a contour plot requires a numeric x and y-axis. To relax this constraint, each bond category can be turned into a number, with USGG3M and USGG6M becoming decimals (.25 and .5 for a quarter and half year). From there, the plot can be constructed.
I’ve added in dots to signal where each bond class value meets with another in the covariance matrix. For example, the USGG30YR and USGG30YR point in the matrix is 7.624, which is labelled in the plot’s far right corner. A contour plot normally has a third variable to interpret (z) but, since the both the x and y-axis are the same values, there isn’t any actually density to review. That said, it’s still an effective way to visualize the covariance matrix.
contour_plot <- covariance_matrix %>%
mutate(bond = c(.25, .5, 2, 3, 5, 10, 30))
contour_plot <- melt(contour_plot, id.vars = "bond")
contour_plot %>%
mutate(value = round(value, 3),
bond = as.numeric(bond),
y = rep(c(.25, .5, 2, 3, 5, 10, 30), each = 7),
label_marker = ifelse(value < 8 | value == 9.583 | value == 11.543, T, F )) %>%
ggplot(aes(bond, y, z = value, label = ifelse(label_marker, value, NA))) +
geom_contour(size = 1.3) +
geom_point(size = 2, colour = "darkorange") +
geom_label() +
scale_x_continuous(breaks = seq(0, 30, 5)) +
scale_y_continuous(breaks = seq(0, 30, 5)) +
labs(title = "Contour plot for bond categories",
subtitle = "Each orange dot represents the covariance intersect of a bond category",
x = "bond category",
y = "bond category",
caption = "Source: Course Project Treasury Data")
I’ve already described this process but, I’ll briefly reiterate it again. To derive the principle component factors, loadings, and factor scores, a series of matrix multiplication operations take place. This starts with a standardized matrix, which is the PCA set with each column mean subtracted (Y naught). This is used to find a covariance matrix between all the bond category yield rates. Thereafter, the covariance matrix is used to find eigenvalues and vectors. The decomposition is possible here because the covariance matrix has full rank and is symmetric. I’ve also included the factor matrix (F) which is the centred matrix multiplied by the eigenvector matrix, which produces factor scores.
The initial calculations include all seven factors, which I’ll reduce to factors one, two, and three given they are the analysis focus.
centred_matrix <- as.matrix(scale(pca_set, scale = F))
means_vector <- colMeans(pca_set)
eigen_values <- eigen(covariance_matrix)$values
eigen_vectors <- eigen(covariance_matrix)$vectors
pca_factors <- data.frame(
Date = treasury_data$Date,
centred_matrix %*% eigen_vectors) %>%
select(Date, X1, X2, X3) %>%
rename(F1 = X1,
F2 = X2,
F3 = X3)
As always with a manual process, it’s good to check if everything went smoothly and the correct values were derived. To set up a comparison, I’ll use the princomp function which is used to automate PCA. I’ll check the eigenvector matrix, which are the loadings, against the one I developed. A comparison of the charts shows that they are the same, which validates the process. I’ll provide a more robust comparison in a later section but, this is a good starting point since it validates the values I derived manually. With this in mind, I’ll use my variables going forward.
bond_pca <- princomp(pca_set)
bond_pca$loadings[,1:7] %>%
custom_kable()
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | |
|---|---|---|---|---|---|---|---|
| USGG3M | -0.3839609 | 0.5074451 | 0.5298222 | -0.4037350 | 0.3860878 | -0.0397629 | 0.0267425 |
| USGG6M | -0.3901870 | 0.4394614 | 0.1114737 | 0.4052645 | -0.6787624 | 0.0947545 | -0.0909135 |
| USGG2YR | -0.4151851 | 0.1111272 | -0.4187873 | 0.4089695 | 0.3787209 | -0.2984864 | 0.4900099 |
| USGG3YR | -0.4063541 | -0.0169699 | -0.4476561 | -0.0643375 | 0.2362448 | 0.1976003 | -0.7315706 |
| USGG5YR | -0.3860610 | -0.2314032 | -0.2462364 | -0.5335766 | -0.2868460 | 0.4212577 | 0.4385596 |
| USGG10YR | -0.3477544 | -0.4324598 | 0.1500903 | -0.1985654 | -0.2562426 | -0.7356186 | -0.1526275 |
| USGG30YR | -0.3047124 | -0.5442123 | 0.4979195 | 0.4209884 | 0.2074508 | 0.3777669 | 0.0091998 |
rownames(eigen_vectors) <- bond_categories
colnames(eigen_vectors) <- c("Comp.1", "Comp.2", "Comp.3", "Comp.4", "Comp.5",
"Comp.6", "Comp.7")
eigen_vectors %>%
custom_kable()
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | |
|---|---|---|---|---|---|---|---|
| USGG3M | -0.3839609 | -0.5074451 | 0.5298222 | 0.4037350 | 0.3860878 | -0.0397629 | 0.0267425 |
| USGG6M | -0.3901870 | -0.4394614 | 0.1114737 | -0.4052645 | -0.6787624 | 0.0947545 | -0.0909135 |
| USGG2YR | -0.4151851 | -0.1111272 | -0.4187873 | -0.4089695 | 0.3787209 | -0.2984864 | 0.4900099 |
| USGG3YR | -0.4063541 | 0.0169699 | -0.4476561 | 0.0643375 | 0.2362448 | 0.1976003 | -0.7315706 |
| USGG5YR | -0.3860610 | 0.2314032 | -0.2462364 | 0.5335766 | -0.2868460 | 0.4212577 | 0.4385596 |
| USGG10YR | -0.3477544 | 0.4324598 | 0.1500903 | 0.1985654 | -0.2562426 | -0.7356186 | -0.1526275 |
| USGG30YR | -0.3047124 | 0.5442123 | 0.4979195 | -0.4209884 | 0.2074508 | 0.3777669 | 0.0091998 |
With the PCA process confirmed, I’ll work towards conducting the factor and loadings review. Here, the real power of conducting a PCA analysis will become clear. To begin, there are seven factors to review, which are the eigenvalues from the previous step.
Eigenvalues here are a proxy for how much variance the analysis captures. These values, or factors, are essentially all the information from the original data set distilled into 7 values. They come ordered so the largest factor is first with decreasing importance thereafter. Each can be reviewed for how much variance it explains by dividing the eigenvalue by the sum of all factors.
The variance explanations can then be charted, as seen below. Factor one captures about 98% of information from all the bond categories. This is fairly astounding but, given how highly correlated each predictor is, not totally unexpected. The power of PCA can be seen firsthand here: In essence, an entire set was compressed into seven factors, the first of which explains about 98% of the entire variance, or information, found in the data. That means that the remaining 2% is spread between the other six factors and, even then, mostly in factor two. As a whole, factors one and two more or less describe all seven predictors. The remaining five have very little information contained in them but, do contribute some marginal variance explanation.
as.data.frame(eigen_values) %>%
mutate(factors = c("F1", "F2", "F3", "F4", "F5", "F6", "F7"),
variance_explained = round(eigen_values / sum(eigen_values), 2),
factors = reorder(factors, variance_explained)) %>%
ggplot(aes(factors, variance_explained,
label = paste0(variance_explained * 100,"%"))) +
geom_col(fill = "royalblue2") +
coord_flip() +
geom_label(hjust = .3, nudge_x = 0.05) +
labs(title = "PCA variance plot- Factor 1 provides most variance explanation (98%)",
caption = "Source: Course Project Treasury Data")
Next are the loadings. As aforementioned, the focus here is on factors one through three, So I’ve created a data frame with just those for analysis. The faceted plot depicts the original loadings alongside a transformed version where F1 has been multiplied by -1 (which switches the value’s sign).
Starting with F1 (red line), it’s clear that the factor is evenly represented across all bond categories, save for some minor fluctuations. F2 (blue line) displays heavy negative weightings on some of the short-term bonds, such as USGG3M and USGG6M, but slowly goes up in long-term bonds. F3 (blue line), forms a “U” shape with higher loadings in short and long-term bonds and decreasing values in the middle-term categories. F2 and F3 are fairly erratic and cross zero whereas F1 is much more stable.
That more or less just describes their movement and shape though. Focusing on a wider interpretation here, there is further meaning to be described. Starting at USGG3M, F3 is the highest loading value with F1 and F2 much further down. As such, F3 can be thought of as capturing the most information from the bond class. Granted, this is probably noisy information given the factor only accounts for a small part of the data set information but, this at least adds an intuitive dimension to the analysis. F1 captures most categories evenly, which again makes sense given it accounts for 98% of the data’s variance.
There’s still a lesson here though: The lower variance factors do capture information on specific bond classes that F1 might not fully account for. F2 seems to have higher loadings than F1 and F3 for USGG3YR; while this might be noise, there could be some valuable outlier information here. In any case, thinking about the nuance of each of the factor loadings help get the most out of the PCA analysis.
As a final macro point, all the loadings can be used to describe the day-to-day movement of the bond yields. Since the first factor captures the most information, it’s movement can be used describe most yield movement. Since the loadings don’t change, they become a marker for rate change and could be applied on any day to find the daily yields in combination with other values.
pca_loadings <- data.frame(
eigen(covariance_matrix)$vectors) %>%
select(X1, X2, X3) %>%
rename(F1 = X1,
F2 = X2,
F3 = X3)
pca_loadings <- melt(pca_loadings,
variable.name = "factor",
value.name = "pca_loadings")
pca_loadings <- pca_loadings %>%
mutate(pca_loadings = ifelse(factor == "F1", pca_loadings * -1, pca_loadings)) %>%
bind_rows(pca_loadings) %>%
mutate(loading_class = ifelse(seq(pca_loadings) == 1:21, "F1 * -1", "original"),
loading_class = factor(loading_class, levels = c("original", "F1 * -1"))) %>%
select(factor, loading_class, pca_loadings)
pca_loadings %>%
mutate(bond_category = rep(factor(bond_categories, levels = bond_categories), 6)) %>%
ggplot(aes(bond_category, pca_loadings, colour = factor, group = factor)) +
geom_line(size = 1.5) +
facet_wrap(facets = "loading_class", nrow = 2) +
scale_y_continuous(breaks = seq(-1, 1, .3)) +
labs(title = "PCA loadings by factor across all bond categories",
subtitle = "F1 even across all bonds, F2 shows negative loadings for short-term classes, F3 is low for medium-term bonds (original interpretations)",
caption = "Source: Course Project Treasury Data")
When assessing the factor scores, a familiar line is evident. In the top facet, factor one has been multiplied by -1 changing the score signs. Following this transformation, the line shows a strong resemblance to Output1. I’ll come back to this in more depth shortly but, the output mystery is finally uncovered here.
In terms of interpretation, since factor one captures almost all of the bond category movement history, it’s movement can be reviewed as a history of yield rate changes since 1981. In the early 1980s, all bond classes had very high yields relative to the rest of the samples. During this time, the F1 scores are at their highest. All of the previous analyses can be seen here too, such as the easing and tightening periods. Additionally, the big financial events, such as the 2008 are evident as well. To offer a colloquial summation: As F1 goes, so too does the federal reserve bond yield rates.
While F2 and F3 are fairly stable in comparison, they capture different bond information and can be thought of as sequential histories as well. Additionally, factor two captures noise that might be useful and not covered by F1. For example, around sample 2800, F2 rises slightly at a time when F1 is decreasing. Places where the lines are incongruent might offer insights that F1 omits but, is captured in another factor. Each factor also has a connection to the rates, though not nearly as strong or defined as F1.
factor_scores <- melt(pca_factors[,-1],
variable.name = "factor",
value.name = "factor_scores")
factor_scores <- factor_scores %>%
mutate(factor_scores = ifelse(factor == "F1", factor_scores * -1, factor_scores)) %>%
bind_rows(factor_scores)
factor_scores %>%
mutate(loading_class = rep(c("F1 * -1", "original"), each = 24900),
loading_class = factor(loading_class, levels = c("F1 * -1", "original")),
index = rep(seq(pca_factors$F1), 6)) %>%
ggplot(aes(index, factor_scores, colour = factor, group = factor)) +
scale_x_continuous(breaks = seq(0, 8500, 1000)) +
geom_line(size = 1.5) +
facet_wrap(facets = "loading_class", nrow = 2, scales = "free_y") +
labs(title = "PCA factor scores across all samples",
caption = "Source: Course Project Treasury Data")
To start, the shape of the plot reveals the orthogonal nature of each factor. By definition, these are linearly independent and as such, do not have any overlapping information. The result of this is a correlation of zero between F1 and F2 (1). The shape also highlights that F1 is leading F2, given the direction of the plot and the interaction of the points. This makes more sense from a time series perspective, which the underlying plot also contains (2). To make this clearer, I’ve added in a colour for each decade in the sample so the history of the two factors changing can be better reviewed.
The spacing of points across each decade are revealing as well. For example, the spacing is more distant and sweeping in the 1980s, a period noted by large swings in day-to-day yield rates. In contrast, the factor scores in the 2010s are spaced much closer together. When reviewing the yield rates during that time, they appear very stable without any major shocks (3). Given this, this bivariate F1-F2 plot, much like the faceted factor score visualization, provides insight into the history of rate changes since 1981 (4). To highlight this, and to assist with audience comprehension, I’ve included date markers to help interpret the plot. Since it moves from right to left, which is traditionally not how time series are viewed, date markers help guide the viewer.
pca_factors <- pca_factors %>%
mutate(F1 = F1 * -1,
decade.marker = substr(as.character(Date), 1, 3),
decade = case_when(
decade.marker == "198" ~ "1980s",
decade.marker == "199" ~ "1990s",
decade.marker == "200" ~ "2000s",
decade.marker == "201" ~ "2010s")) %>%
select(Date, decade, F1, F2, F3)
pca_factors %>%
mutate(date_indicate = case_when(
Date == "1981-01-05" ~ "Start here: 1981",
Date == "2014-06-26" ~ "End here: 2014")) %>%
ggplot(aes(F1, F2, colour = decade, label = date_indicate)) +
geom_point(size = 3, alpha = .3) +
geom_label(hjust = .3, nudge_x = -5.5, nudge_y = -.5, fontface = "bold") +
scale_y_continuous(breaks = seq(-5, 5, 1)) +
labs(title = "PCA factor 1 vs 2 over time from 1981 to 2014 (right to left)",
subtitle = "Factor spacings indicate different yield volatilty across decades, factors are orthogonal (R = 0), F1 leading F2",
caption = "Source: Course Project Treasury Data")
The factor adjustments shape the yield term curves in different ways. I’ve included a faceted plot to compare both term curves with individual and combined factor influence. This comparison allows for an easier visual comparison of how each factor influences the curve. Overall, F1 provides a parallel shift for the line, F2 provides a slope change, and F3 affects the middle curvature of the term curve.
pca_loadings <- data.frame(
eigen(covariance_matrix)$vectors) %>%
select(X1, X2, X3) %>%
mutate(X1 = X1 * -1)
old_curve <- treasury_data[135,2:8]
new_curve <- treasury_data[136,2:8]
factor_change <- pca_factors[136,3:5] - pca_factors[135,3:5]
curve_change <- old_curve - new_curve
F1_adj <- old_curve + (t(pca_loadings[,1]) * as.numeric(factor_change[1]))
F2_adj <- old_curve + (t(pca_loadings[,1]) * as.numeric(factor_change[1])) +
(t(pca_loadings[,2]) * as.numeric(factor_change[2]))
F3_adj <- old_curve + (t(pca_loadings[,1]) * as.numeric(factor_change[1])) +
(t(pca_loadings[,2]) * as.numeric(factor_change[2])) +
(t(pca_loadings[,3]) * as.numeric(factor_change[3]))
F2 <- old_curve + (t(pca_loadings[,2]) * as.numeric(factor_change[2]))
F3 <- old_curve + (t(pca_loadings[,3]) * as.numeric(factor_change[3]))
curve_adjustment <- bind_rows(old_curve, new_curve, F1_adj, F2_adj, F3_adj,
old_curve, new_curve, F1_adj, F2, F3)
curve_adjustment <- melt(curve_adjustment,
variable.name = "bond_category",
value.name = "curve_adjustment")
curves <- c("Old Curve", "New Curve", "1-Factor Adj.",
"2-Factor Adj.", "3-Factor Adj.")
adjustment <- c("All factors", "Individual factors")
curve_adjustment <- curve_adjustment %>%
mutate(adjustment = factor(rep(adjustment, each = 5, 7), levels = adjustment),
curve = factor(rep(curves, 14), levels = curves))
curve_adjustment %>%
ggplot(aes(bond_category, curve_adjustment, colour = curve, group = curve)) +
geom_line(size = 1.5, alpha = .4) +
facet_wrap(facets = "adjustment", nrow = 2) +
scale_y_continuous(breaks = seq(10, 18, .5)) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
labs(title = "Yields rates by category with adjusted term curves for individual and combined factors",
subtitle = "F1 provides parallel shift, F2 provides change of slope, F3 changes middle curvature",
caption = "Source: Course Project Treasury Data")
The USGG10YR approximations are very close to the actual values. These are derived by using the mean value for USGG10YR (taken from the vector of means) in combination with the first three loadings for the bond category multiplied by all three factors.
ten_year <- data.frame(
pca_approx = means_vector[6] +
pca_loadings[6,1] *
pca_factors[,3] +
pca_loadings[6,2] *
pca_factors[,4] +
pca_loadings[6,3] *
pca_factors[,5],
USGG10YR = treasury_data$USGG10YR
)
ten_year %>%
slice(1:10) %>%
custom_kable()
| pca_approx | USGG10YR |
|---|---|
| 11.99677 | 12.152 |
| 11.97187 | 12.112 |
| 12.25286 | 12.382 |
| 12.27920 | 12.352 |
| 12.51041 | 12.572 |
| 12.37916 | 12.452 |
| 12.41123 | 12.532 |
| 12.37217 | 12.532 |
| 12.53218 | 12.622 |
| 12.41604 | 12.532 |
Two metrics provide insight into the goodness of fit. Mean absolute error (MAE) provides insight on how large the average miss between the approximation and actual values are. It indicates the average miss was only about .038. Mean absolute percentage error (MAPE) is very similar but, provides a percentage the approximation was off from the actual. Here, the approximations were, on average, about .757% off from the actual. Again, this points to a very close fit between the approximations and actuals.
ten_year %>%
summarise(MAE = round(mean(abs(pca_approx - USGG10YR)), 3),
MAPE = round(mean(MAE / USGG10YR * 100), 3)) %>%
custom_kable()
| MAE | MAPE |
|---|---|
| 0.038 | 0.757 |
The plot of both lines is, unsurprisingly, indicative of a very close fit as well. In fact, both lines look nearly identical, which expected given the average miss is so low.
ten_year <- melt(ten_year,
variable.name = "value_composition",
value.name = "USGG10YR")
ten_year %>%
mutate(index = rep(seq(treasury_data$USGG10YR), 2)) %>%
ggplot(aes(index, USGG10YR, colour = value_composition)) +
geom_line(aes(size = value_composition), alpha = .65) +
scale_size_manual(values = c(2, 1)) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
scale_colour_manual(values = c("darkorange", "royalblue2")) +
labs(title = "Actual USGG10YR yield vs PCA approximated numbers- values are very closely aligned",
caption = "Source: Course Project Treasury Data")
princomp functionI began this process already simply because I wanted to vet my manual values before using them in this section of the analysis. However, I only provided a cursory check so I’ll do a more thorough review here. I’ve already developed the PCA object for the predictors so in the first chunk, I’m just seeing what names are associated with it.
c(pca_names = names(bond_pca)) %>%
custom_kable()
| pca_names1 | sdev |
| pca_names2 | loadings |
| pca_names3 | center |
| pca_names4 | scale |
| pca_names5 | n.obs |
| pca_names6 | scores |
| pca_names7 | call |
The eigenvector matrix and loadings were done in my first comparison but, to reiterate here using the first three factors, the values are the same, save for sign differences in factor two.
data.frame(
bond_pca$loadings[,1:3],
eigen_vectors[,1:3]) %>%
custom_kable()
| Comp.1 | Comp.2 | Comp.3 | Comp.1.1 | Comp.2.1 | Comp.3.1 | |
|---|---|---|---|---|---|---|
| USGG3M | -0.3839609 | 0.5074451 | 0.5298222 | -0.3839609 | -0.5074451 | 0.5298222 |
| USGG6M | -0.3901870 | 0.4394614 | 0.1114737 | -0.3901870 | -0.4394614 | 0.1114737 |
| USGG2YR | -0.4151851 | 0.1111272 | -0.4187873 | -0.4151851 | -0.1111272 | -0.4187873 |
| USGG3YR | -0.4063541 | -0.0169699 | -0.4476561 | -0.4063541 | 0.0169699 | -0.4476561 |
| USGG5YR | -0.3860610 | -0.2314032 | -0.2462364 | -0.3860610 | 0.2314032 | -0.2462364 |
| USGG10YR | -0.3477544 | -0.4324598 | 0.1500903 | -0.3477544 | 0.4324598 | 0.1500903 |
| USGG30YR | -0.3047124 | -0.5442123 | 0.4979195 | -0.3047124 | 0.5442123 | 0.4979195 |
In the same vein, the loadings plot is the same as well.
pca_loadings <- melt(bond_pca$loadings[1:7,1:3],
variable.name = "factor",
value.name = "pca_loadings")
pca_loadings <- pca_loadings %>%
mutate(pca_loadings = ifelse(Var2 == "Comp.1", pca_loadings * -1, pca_loadings)) %>%
bind_rows(pca_loadings) %>%
mutate(loading_class = ifelse(seq(pca_loadings) == 1:21, "F1 * -1", "princomp"),
loading_class = factor(loading_class, levels = c("princomp", "F1 * -1"))) %>%
rename(factor = Var2,
bond_category = Var1) %>%
select(bond_category, factor, loading_class, pca_loadings)
pca_loadings %>%
ggplot(aes(bond_category, pca_loadings, colour = factor, group = factor)) +
geom_line(size = 1.5) +
facet_wrap(facets = "loading_class", nrow = 2) +
scale_y_continuous(breaks = seq(-1, 1, .3)) +
labs(title = "PCA loadings by factor across all bond categories using princomp",
caption = "Source: Course Project Treasury Data")
This also holds true for the factor scores plot.
factor_scores <- melt(bond_pca$scores[,1:3],
variable.name = "factor",
value.name = "factor_scores")
factor_scores <- factor_scores %>%
mutate(factor_scores = ifelse(Var2 == "Comp.1", factor_scores * -1, factor_scores)) %>%
bind_rows(factor_scores) %>%
select(-Var1)
factor_scores %>%
rename(factor = Var2) %>%
mutate(loading_class = rep(c("F1 * -1", "princomp"), each = 24900),
loading_class = factor(loading_class, levels = c("F1 * -1", "princomp")),
index = rep(seq(pca_factors$F1), 6)) %>%
ggplot(aes(index, factor_scores, colour = factor, group = factor)) +
scale_x_continuous(breaks = seq(0, 8500, 1000)) +
geom_line(size = 1.5) +
facet_wrap(facets = "loading_class", nrow = 2, scales = "free_y") +
labs(title = "PCA factor scores across all samples from princomp",
caption = "Source: Course Project Treasury Data")
Put simply, Output1 is the factor scores from F1 as derived from the PCA analysis. This means that it is essentially a meta-feature of information compiled from all the original bond category data, which is a linear combination of the predictors. Output1 captures about 98% of the information from all the yield rates making it a very good indicator of how all the bond categories move in one condensed variable. Given all the yields are so closely correlated, PCA is a very reasonable approach here. From a practical point of view, this would be an excellent addition to any financially based analyses where the project called for using the treasury yields because it captures so much rate information while also providing dimensional reduction. That said, the models developed throughout the analysis aren’t really useful for wider predictions since they gauge the predictors relationship to themselves.
output_uncovered <- data.frame(
value_composition = rep(c("Output1 from data", "Manual Factor 1", "Princomp F1"),
each = 8300),
values = as.vector(c(treasury_data$Output1, pca_factors[,3], bond_pca$scores[,1] * -1))
)
output_uncovered %>%
mutate(index = rep(seq(treasury_data$Output1), 3)) %>%
ggplot(aes(index, values, colour = value_composition)) +
geom_line(aes(size = value_composition), alpha = .65) +
scale_size_manual(values = c(3, 2, 1)) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
scale_colour_manual(values = c("gray", "darkorange", "royalblue2")) +
labs(title = "Output1, manual factor 1 scores, and princomp factor 1 scores- plot shows perfect fit between all three groups",
subtitle = "Output1 is the factor scores from F1 as derived from the PCA analysis",
caption = "Source: Course Project Treasury Data")
This table clearly highlights that the slope and intercept coefficients from the switched regression (with Output1 as the predictor and each bond category as the outcome variable) equal to the PCA loadings and centres (variable means).
data.frame(
t(apply(treasury_data[,2:8], 2,
function(x) summary(lm(x ~ Output1, treasury_data))$coefficients[1:2])),
pca_centres = bond_pca$center,
pca_loadings = bond_pca$loadings[,1] * -1) %>%
rename(regression_intercept = X1,
regression_slope = X2) %>%
custom_kable()
| regression_intercept | regression_slope | pca_centres | pca_loadings | |
|---|---|---|---|---|
| USGG3M | 4.675134 | 0.3839609 | 4.675134 | 0.3839609 |
| USGG6M | 4.844369 | 0.3901870 | 4.844369 | 0.3901870 |
| USGG2YR | 5.438888 | 0.4151851 | 5.438888 | 0.4151851 |
| USGG3YR | 5.644458 | 0.4063541 | 5.644458 | 0.4063541 |
| USGG5YR | 6.009421 | 0.3860610 | 6.009421 | 0.3860610 |
| USGG10YR | 6.481316 | 0.3477544 | 6.481316 | 0.3477544 |
| USGG30YR | 6.869355 | 0.3047124 | 6.869355 | 0.3047124 |
Using the centred matrix I constructed earlier, this line of inquiry can be reviewed.
c(dim = dim(centred_matrix)) %>%
custom_kable()
| dim1 | 8300 |
| dim2 | 7 |
The same process as before can be applied, which shows that the original model (Output1 ~ predictors) has slope coefficients equal to the slope coefficients of the centred matrix linear model.
centred_matrix <- as.data.frame(centred_matrix) %>%
mutate(Output1 = treasury_data$Output1)
t(data.frame(
original_lm = t(apply(treasury_data[,2:8], 2,
function(x) summary(lm(Output1 ~ x, treasury_data))$coefficients[2])),
centred_lm = t(apply(centred_matrix, 2,
function(x) summary(lm(Output1 ~ x, centred_matrix))$coefficients[2]))))%>%
custom_kable()
| original_lm.USGG3M | 2.507561 |
| original_lm.USGG6M | 2.497235 |
| original_lm.USGG2YR | 2.400449 |
| original_lm.USGG3YR | 2.455793 |
| original_lm.USGG5YR | 2.568742 |
| original_lm.USGG10YR | 2.786991 |
| original_lm.USGG30YR | 3.069561 |
| centred_lm.USGG3M | 2.507561 |
| centred_lm.USGG6M | 2.497235 |
| centred_lm.USGG2YR | 2.400449 |
| centred_lm.USGG3YR | 2.455793 |
| centred_lm.USGG5YR | 2.568742 |
| centred_lm.USGG10YR | 2.786991 |
| centred_lm.USGG30YR | 3.069561 |
| centred_lm.Output1 | 1.000000 |
Building on this, the centred matrix slope coefficients are equal to the first row PCA loadings.
data.frame(
loadings = bond_pca$loadings[,1] * -1,
from_model = t(lm(Output1 ~ ., data = centred_matrix)$coef)[-1]) %>%
custom_kable()
| loadings | from_model | |
|---|---|---|
| USGG3M | 0.3839609 | 0.3839609 |
| USGG6M | 0.3901870 | 0.3901870 |
| USGG2YR | 0.4151851 | 0.4151851 |
| USGG3YR | 0.4063541 | 0.4063541 |
| USGG5YR | 0.3860610 | 0.3860610 |
| USGG10YR | 0.3477544 | 0.3477544 |
| USGG30YR | 0.3047124 | 0.3047124 |